System and Method for Recommending Semantically Relevant Content

ABSTRACT

A property vector derived from extractable measurable properties of a data file is mapped to semantic properties for that data file. The property vector is an output from a trained artificial neural network that, following pairwise training of the ANN using pairs of files that map pairwise similarity/dissimilarity in property space towards corresponding pairwise semantic similarity/dissimilarity in semantic space, both preserves and is representative of semantic properties of the data file. The system and method assesses, based on comparisons between generated property vectors, ranks and then recommends and/or filters semantically close or semantically disparate candidate files in a database from a query from a user that includes the data file. Applications of the categorization and recommendation system and method apply to media or search tools and social media platforms, including media in the form of music, video, images data and/or text files.

BACKGROUND TO THE INVENTION

This invention relates, in general, to artificial intelligence andneural networks that generate a property-based file vector derived fromextracted properties from a data file and where the property-based filevector both preserves and is representative of semantic properties ofthe data file. More particularly, the present invention relates to asystem and method for assessing, ranking and then recommending and/orfiltering semantically close or semantically disparate candidate filesin a database in response to a query from a user, especially but notexclusively to a query in an internet-based platform such as a socialmedia platform or search engine. More especially, applications of thecategorization and recommendation system and method apply to media inthe form of music, video and/or images data files, although theapplication widely applicable and finds application in evaluation ofspeech and text data files, including personal reports.

SUMMARY OF THE PRIOR ART

One of the most challenging long-term objectives for artificialintelligence “AI”, typically based on an artificial neural networkarchitecture, is to replicate human intellectual behaviour. This is acomplex proposition not least because human opinion is based onsubjective responses to stimuli and existing approaches in AI do notcorrelate well with emotional perspective responses. Rather, therationale for the computing architecture in AI is implication of a mostlikely response based on assimilation of large quantities of data thathave objectively discernible properties.

Refinement, i.e. training, of a deep neural network “DNN” is frequentlybased on the concept of “backpropagation”, i.e. the backward propagationof errors, to calculate a gradient that is needed in the DNN'scalculation of the weights to be used in the network, as will beunderstood. The DNN therefore moves through its layers, as will beunderstood, calculate the probability of each output in an attempt tofind the correct mathematical manipulation that turns the input into theoutput irrespective of whether it be a linear relationship or anon-linear relationship.

As a practical example of the current limited approach in AI to musicinterpretation, identification of perceived similarity between differentpieces of music is constrained to absolute similarities rather thanbeing related to semantically-perceived similarities. This may, at firstinspection, not appear problematic, but on an intellectual and realfooting a fundamental problem remains because “there is no such thing asmusic, except as created, perceived, and experienced in the human mind.In other words, “Music, in its own right, does not exist . . . becauseneither music nor language can be studied as pure surface forms becausethe cognition of both produces information which is not contained in thesurface form”. This proposition is established in the paper “On thenon-existence of music: why music theory is a figment of theimagination” by Geraint A. Wiggins et al in ESCOM European Society forthe Cognitive Sciences of Music, Musicæ Scientiæ, Discussion Form 5,2010, pages 231-255.

Hence, existing AI modelling that, from its outset, is based on a degreeof absoluteness (based on the interpretation of measured parameters) isfatally flawed with the consequence that it will generate, in theexemplary context of a musical search tool, inconsistent and/or spuriousresults.

The same problems exist with the identification and categorization ofother forms of expression, such as paintings or photographs or indeedinterpretations of imagery, such as medical CT scans, or other purelydescriptive expressions (such as a description of a smell, a medicalreport or an outline of a plot in a work of fiction) to locate andassess, relative to a defined start point (e.g. a particular descriptionof a fragrance or the tonality, rhythm and timbre of a musicalcomposition), the relevance of searchable electronic images and/or datathat are either entirely unrelated or otherwise are potentially relevantto one another from the perspective of having an acceptably close set ofsubjective attributes, qualities or characteristics.

In fact, existing AI systems cannot resolve semantically-relevantattributes and therefore can both overlook semantic similarities whilstaccepting or suggesting that perceptually-distinct dissimilarities areclosely related.

The music, film and gaming industry—and particularly aspects relating tothe provision of content—is evolving. In this respect, the sale ordistribution of (for example) music or soundtracks as either streamed ordownloaded digital files is becoming dominant in those markets. Thiscontrasts with the sale of compact disc and DVD technologies (or,historically, vinyl disks) through established, but now waning, customretail outlets.

Whilst music sales are commercial and content perceptual and aestheticin nature, there is no existing, straightforward and reliable mechanismsto locate tracks that share common musical characteristics honed to anindividual's specific tastes. To qualify this statement, music isbroadly categorised in terms of its genre, e.g. jazz, rock, classicaland blues to name but a few, but within each of these genres thereusually exist vast numbers of sub-categories or sub-species. Forexample, there are apparently at least thirty different forms of jazz,including free-funk, crossover, hard bop and swing. These sub-speciesmay share some overarching similarities in user-discerniblecompositional architectures that define the genus, but frequently thereare also significant dissimilarities that are sufficiently audibly ormusically pronounced. To provide further context, two different speciesof jazz may perceptually be so profoundly audibly different for aparticular listener to conclude that one is likeable whereas the otheris not. By the same (but reverse) token, a listener may prematurelydisregard (or simply just not be aware that) a piece of classical musicbased on a flawed perception that its listening characteristics [inmusical space and in a musical sense] should be disparate to thoseorchestrated a piece of hard rock when, in fact, these two differentaudio tracks are substantially identical in terms of their closeness inmusical space.

With typically online music libraries each containing millions ofsongs—the iTunes and Tidal® music libraries allegedly each containaround fifty million tracks—the problem exists about how these databasescan be effectively searched to identify user-perceived common musicalthemes, traits or features between myriad tracks potentially spanningentirely different genres. Consequently, a search for similar musiccould—and, to date, indeed frequently does—discount entire genres [or atleast sub-species of a genre] from consideration and/or fails toassociate together extremely relevant musical content in differenttracks from different genres. Commercial libraries can make use of“collaborative filtering” in which recommendations are made based on theplaylists of other users who have listened to the same song, but thisapproach depends heavily on stored user data and statistical usage.Collaborative filtering can reflect the personal preferences of alistener/user of the library, but it is limited by the amount of userdata available and so is not in itself a complete solution.

There is also the issue of “cold start” which arises when a new (in thesense of an unknown or little known) artist [i.e. a novice, newcomer or“newbie” potentially signed by a recording studio or label] releasestheir first audio track or first album. The problem is that the artistis unknown and therefore has no effective following either on-line orelsewhere, such as acquired listeners from promotion over the radioaether or television. Expressing this differently, the lack of alistening history provides a roadblock both to making recommendations,such as through collaborative filtering, or establishing a reputationand following for the newbie. The problems for the distributor, e.g. arecord label, are how do they raise awareness of the new artist, how dothey categorize the nature [which arguably is variable since it isuser-perceivable] of the new artist's music and, in fact, how do theylink/insert the music into an existing music library so that it islistened to, downloaded or streamed to ensure maximum exposure forcommercialization reasons? The problem for the listening and/orstreaming public or radio stations is that, in the context of thesenewbies, ‘they don't know what they don't know’ so the probability ofrandomly finding the newbie's initial foray into the world of music isslim and based more on luck than judgement.

For the distributor, effective exposure of and access to the artist'smusical tracks equates to an increased likelihood of sales. Indeed, froma commercial perspective, it is also desirable to avoid a “slow burn”and therefore rapidly to grow the reputation of a new artist.

In short, the new artist must break into the market with an unproven andnew product. In contrast, fans of existing artists will invariablyfollow, i.e. both monitor and generally be inclined to purchase,newly-released music from those existing artists irrespective of whethersuch newly-released music is good or bad. Indeed, even with poorcritical acclaim, newly-released music from a popular artist will bestreamed, listened to and/or purchased so the “cold start” problem doesnot exist for existing artists with an established following andlistener base. The cold-start problem therefore stifles dissemination ofmusic and also the potential evolution of new form of music.

In addition, the nature of user perception and musical appreciation is arapidly employed personal trait. Particularly, a listener will make anassessment about whether a track is palatable and preferably to theirindividual taste within a few seconds of the track (or a sectionthereof) being played/heard. Consequently, any track findingrecommendation scheme, realised for example as a downloadable app, mustbe intrinsically quick (in terms of identifying a recommendation) andalso reliable in that any recommendation it makes needs to satisfyuser-perceived musical values, i.e. personal musical tastes. Any trackfinding recommendation tool that throws up seemingly random tracks, suchas those of existing systems that make use of statistical analysis ofdemographic data by other users with identified common interests orcircumstances (e.g. age range 30-40, married with two children, workingas an accountant and living in a mortgaged property in Staten Island,N.Y.), is ultimately poor and its use disregarded or discounted.Perceptual categorization of musicologically-similar audio tracks,irrespective of genre, is therefore an important consideration foreffective audio track finding technologies.

The problems identified above are not addressed by existing apps such asShazam and SoundHound® since these apps focus on identification of anaudio track that is sampled in real-time or otherwise these apps listtracks that others in the community are discovering. With SoundHound®, asong can be sung or hummed to try to identify it. These apps thereforeidentify the track being played/sampled or, based on reported hardnumbers, they may make a recommendation for potential further listeningthat, frequently, is not overly relevant. These existing apps provide noperception of musicological similarities into myriad tracks in a musiclibrary.

Another of the issues faced by the music industry is how best to augmentthe listener/user experience, especially on a personal/individual level.Indeed, it has long been recognized that the contextual relevance of orrelationship between a piece of music and an event brings aboutrecognition or induces a complementary emotional response, e.g. afeeling of dread or suspense during a film or a product associationarising in TV advertising. Identification of common musical traits isdesirable because it has been recognized that appropriate use of musicalcontent supports emotional, physiological and/or psychologicalengagement of the listener and therefore promotes the listener's sensoryexperience. This is, for example, relevant to game developers and/oradvert or film trailer producers/editors who are tasked with rapidlycompiling a suitable multimedia product that aligns relevant musicthemes, such as increasing musical intensity (in the context of anincreasing sense of developing drama and urgency and not necessarily inthe context of an absolute audio power output level) with video output.In providing at least one resultant “proof” for review, the developer oreditor has already expended considerable time in identifying potentiallysuitable music and then fitting/aligning the selected music to thevideo. To delay having to identify a commercially-usable audio track,content developers presently may make use of so-called “temp tracks”that are often well-known tracks having rights that cannot be easilyobtained, but this is just a stop-gap measure because a search is thenrequired to identify a suitable commercially-viable track for which userights can be obtained. Further time delays then arise from theinstructing client having to assessing whether the edit fits with theiroriginal brief. Therefore, an effective track searching tool wouldfacilitate identification of a selection of alternative musical tracksfor alignment with, for example, a visual sequence or the building of amusical program (such as occurs within “spin” classes that choreographcycling exercise to music to promote work rates).

Technology does exist on the web to search for images having identicalor similar visual characteristics, including identifying websites thatpresent such identical or related images. For example, Google® supportsa computer program application [sometimes foreshortened to the term“app”] called “Reverse Image Search” (seehttps://support.google.com/websearch/answer/1325808?h1=en) in which anuploaded image is apparently broken down into groups of constituentbits, at a server, and those groups of bits searched to identify relatedimages according to some form of logical distance measure within adefined parameter space. Identified related images are then provided tothe user who made use of the app and who uploaded the original image.

Whilst image comparison requires complex computations (typically basedon a neural network), it is observed that the fundamental sourcedocument can be broken down into shapes, colour(s) and/or dimensions,such as angles or lengths. Contrasting of one or more of these factorsallows for association to be established, e.g. through relative scaling.In contrast, a critique of musical characteristics, although againmaking use of a neural network, has to date been generally hampered bythe difficulties in resolving perceptually more subtle differences inmusical structures.

SUMMARY OF THE INVENTION

According to a first aspect of the invention there is provided a methodof evaluating semantic closeness of a source data file relative to atleast some of a plurality of candidate data files in a database and, inresponse to the evaluation, generating a list identifying at least onesemantically close candidate data file from the database, the methodcomprising: processing the source file to extract properties therefrom;calculating a file vector in property space from said extractedproperties, wherein the file vector both preserves and is representativeof semantic properties of content of the source data file; comparing thefile vector to a plurality of property vectors, wherein each of said atleast some of the plurality of candidate data files has an associatedproperty vector of said plurality of property vectors; determining ameasured separation in continuous multi-dimensional property spacebetween the file vector of the source file relative to respectiveproperty vectors of the at least some of the plurality of candidate datafiles; generating said list based on said measured separation andsemantic closeness of content of the source data file; and providing thelist as a recommendation.

The source data file may be compared against all of the candidate datafiles in the database.

The file vector and each property vector is an output from a trainedartificial neural network “ANN” that, following pairwise training of theANN using pairs of training files, maps pairwisesimilarity/dissimilarity in property space towards correspondingpairwise semantic similarity/dissimilarity in semantic space to preservesemantic evaluation by valuing, on a pairwise basis, semantic perceptionreflected in quantified semantic dissimilarity distance measures overproperty assessment reflected by distance measures in property space.

The database may include property vectors for candidate filescross-referenced to a descriptor or code identifying the content of eachcandidate file.

In an embodiment, the method includes preventing upload of the sourcedata file when the descriptor or code indicates content in the sourcedata file is inappropriate for circulation or publication.

The method may subsequently generate a report identifying a point oforigin or user identify for the source file.

In other embodiments, the method includes: supplying candidates files onthe list to a predictor arranged to refine the recommendation; inputtingat least one of user data and media information relating to content intothe predictor; and generating a revised list of candidate data fileshaving regard to the list and the user data and/or media information.

In another embodiment, the method further comprises: computing a filevector as an embedding for the source data file; detecting a number ofclose neighbour candidate files for which determined distance measuresrelative to the file vector do not exceed a predefined threshold;assembling one or more textual descriptions for the candidate filesreflective of respective property vectors therefor; generating arepresentative composite textual description from descriptionsassociated with candidate files within the threshold distance; andmaking the representative composite textual description available.

The data files may contain content in the form of at least one of music,video, images data, speech, and text files.

In another aspect of the invention there is provided a method ofgenerating a playlist, the method comprising: processing the source fileto extract properties therefrom; calculating a file vector in propertyspace from said extracted properties, wherein the file vector bothpreserves and is representative of semantic properties of content of thesource data file; comparing the file vector to a plurality of propertyvectors, wherein each property vector is associated with a candidatedata file of a plurality of candidate data files in a database;determining a measured separation in continuous multi-dimensionalproperty space between the file vector of the source data file relativeto respective property vectors of the at least some of the plurality ofcandidate data files; and plotting a progressive transition through theplaylist by selecting candidate data files between the source data fileand an end data file in which transitions between consecutive data filesin the playlist ensures that transitions and semantic distances betweenadjacent data files are within a threshold distance and that thedirection of travel through the playlist is semantically towards the enddata file.

Transitions between adjacent files may be the shortest measuredseparation in continuous multi-dimensional property space.

Each transition between adjacent files may assess measured separation ina subset of candidate data files in the database.

In the method of generating a playlist, the file vector and eachproperty vector is an output from a trained artificial neural network“ANN” that, following pairwise training of the ANN using pairs oftraining files, maps pairwise similarity/dissimilarity in property spacetowards corresponding pairwise semantic similarity/dissimilarity insemantic space to preserve semantic evaluation by valuing, on a pairwisebasis, semantic perception reflected in quantified semanticdissimilarity distance measures over property assessment reflected bydistance measures in property space.

In yet another aspect of the invention there is provided a method ofproviding a file recommendation based on sematic qualities, the methodcomprising: identifying a recently consumed reference data file that hasbeen consumed by a user; processing the reference data file to extractproperties therefrom; calculating a first file vector in property spacefrom said extracted properties, wherein the first file vector bothpreserves and is representative of semantic properties of content of thereference data file; evaluating a new data file in terms of semanticcloseness to the reference data file, said evaluation based on arelative comparison between the first file vector and a different secondfile vector derived from properties of the new data file and where thesecond file vector also preserves and is representative of semanticproperties of content of the new data file; determining availability andextent of at least one of (a) user data obtained for the user, and (b)property vectors in candidate file data, said property vectorsreflective of semantic qualities therein; providing the filerecommendation based on a probabilistic weighting between: acontent-based approach of semantic closeness evaluated between thereference data file and the new data file; and a predictive approachbased on one of a predictive model, a reinforcement learning “RL”algorithm or heuristic processing function, wherein the predictiveapproach is based on sufficiency in availability of user data andproperty vectors in candidate file data.

The probabilistic weighting between the content-based approach and thepredictive approach may vary with time.

Initially, the content-based approach may be absolute.

In a further aspect of the invention there is provided a system ofevaluating semantic closeness of a source data file relative to at leastsome of a plurality of candidate data files stored in a database, thesystem comprising processing intelligence arranged to: process thesource file to extract properties therefrom; calculate a file vector inproperty space from said extracted properties, wherein the file vectorboth preserves and is representative of semantic properties of contentof the source data file; compare the file vector to a plurality ofproperty vectors, wherein each property vector of the plurality ofproperty vectors is associated with a particular candidate data file ofthe plurality of stored candidate data files; determine a measuredseparation in continuous multi-dimensional property space between thefile vector of the source file relative to respective property vectorsof at least some of the plurality of candidate data files; generate alist based on said measured separation and semantic closeness of contentof the source data file, wherein the list identifies, relative to thesource data file, at least one semantically close candidate data filefrom the database; and provide the list as a recommendation.

The system intelligence may compare the source data file against all ofthe candidate data files in the database

The file vector and each property vector is an output from a trainedartificial neural network “ANN” that, following pairwise training of theANN using pairs of training files, maps pairwisesimilarity/dissimilarity in property space towards correspondingpairwise semantic similarity/dissimilarity in semantic space to preservesemantic evaluation by valuing, on a pairwise basis, semantic perceptionreflected in quantified semantic dissimilarity distance measures overproperty assessment reflected by distance measures in property space.

The database may include property vectors for candidate filescross-referenced to a descriptor or code identifying the content of eachcandidate file.

The system intelligence may be arranged to prevent upload of the sourcedata file when the descriptor or code indicates content in the sourcefile is inappropriate for circulation or publication.

The system intelligence may be further arranged to generate a report theidentification of a point of origin or user identify for the sourcefile. This information may be obtained from user credentials, includingregistration and log-in details, or a MAC address.

The database may remote to a user device that is arranged to upload thesource data file.

The system may include a predictor arranged to refine therecommendation, wherein the predictor has: a first input responsive tocandidate data files on the list; and at least a second input responsiveto at least one of user data and media information relating to content;and wherein the predictor is arranged to generate a revised list ofcandidate data files having regard to the list and the user data and/ormedia information.

In a particular embodiment, the system intelligence is arranged to:compute a file vector as an embedding for the source data file; detect anumber of close neighbour candidate files for which determined distancemeasures relative to the file vector do not exceed a predefinedthreshold; assemble one or more textual descriptions for the candidatefiles reflective of respective property vectors therefor; generate arepresentative composite textual description from descriptionsassociated with candidate files within the threshold distance; and makethe representative composite textual description available.

The data files can contain content in the form of at least one of:music, video, images data, speech, and text files.

The system intelligence may be a server-side component remotely andselectively connected to a user device over a network. The systemintelligence may be distributed, but it can also be realised in softwareor a combination of software and hardware.

In yet another aspect of the invention there is provided a processor forgenerating a playlist from candidate files stored in a database, theprocessor arranged to: process a source data file to extract propertiesfrom content thereof; calculate a file vector in property space fromsaid extracted properties, wherein the file vector both preserves and isrepresentative of semantic properties of the content of the source datafile; compare the file vector to a plurality of property vectors,wherein each property vector is associated with a candidate data file ofa plurality of candidate data files; determine a measured separation incontinuous multi-dimensional property space between the file vector ofthe source data file relative to respective property vectors of the atleast some of the plurality of candidate data files; and plot aprogressive transition through the playlist by selecting candidate datafiles between the source data file and an end data file in whichtransitions between consecutive data files in the playlist ensures thattransitions and semantic distances between adjacent data files arewithin a threshold distance and that the direction of travel through theplaylist is semantically towards the end data file.

Transitions between adjacent files may be assessed by the processor tobe the shortest measured separation in continuous multi-dimensionalproperty space.

Each transition between adjacent files may be assessed on the basis ofmeasured separation to a subset of candidate data files in the database.

In still yet another aspect of the invention there is provided a systemcontaining processing intelligence arranged to provide a filerecommendation based on sematic qualities, the system intelligencearranged to: process a reference data file to extract propertiestherefrom; calculate a first file vector in property space from saidextracted properties, wherein the first file vector both preserves andis representative of semantic properties of content of the referencedata file; evaluate a new data file in terms of semantic closeness tothe reference data file, said evaluation based on a relative comparisonbetween the first file vector and a different second file vector derivedfrom properties of the new data file and where the second file vectoralso preserves and is representative of semantic properties of contentof the new data file; determine availability and extent of at least oneof (a) user data obtained for the user, and (b) property vectors incandidate file data, said property vectors reflective of semanticqualities therein; provide the file recommendation based on aprobabilistic weighting between: a content-based approach of semanticcloseness evaluated between the reference data file and the new datafile; and a predictive approach based on one of a predictive model, areinforcement learning “RL” algorithm or heuristic processing function,wherein the predictive approach is based on sufficiency in availabilityof user data and property vectors in candidate file data.

The system intelligence may be arranged to vary with time theprobabilistic weighting between the content based approach and thepredictive approach.

The system intelligence may initially make the content-based approachabsolute.

The system intelligence may be a server-side component remotely andselectively connected to a user device over a network. Alternatively,the system intelligence is located, at least in part, in a user device.

According to the various embodiments and aspects described, the propertyvector is derived from extractable measurable properties of a data fileis mapped to semantic properties for that data file. The property vectoris an output from a trained artificial neural network that, followingpairwise training of the ANN using pairs of files that map pairwisesimilarity/dissimilarity in property space towards correspondingpairwise semantic similarity/dissimilarity in semantic space, bothpreserves and is representative of semantic properties of the data file.The system and method assesses, based on comparisons between generatedproperty vectors, ranks and then recommends and/or filters semanticallyclose or semantically disparate candidate files in a database from aquery from a user that includes the data file. Applications of thecategorization and recommendation system and method apply to media orsearch tools and social media platforms, including media in the form ofmusic, video, images data and/or text files.

The processing intelligence functions to associates quantified semanticdissimilarity measures for said content in semantic space with relatedproperty separation distances in property space for measurableproperties extracted for that content.

Fundamentally, the approach differs from current data science approachesthat have their approaches rooted back to hard and/or absolute datavalues. Rather, the system makes use of weighted output results from aneural network tasked with evaluating, in a vector space, dissimilarityof extracted measurable properties of pairwise-contrasted source filesback towards human perception of similarity/dissimilarity as expressedin semantic space between the same pairwise-contrasted source files.This semantic space is a different vector space in which subjectivedescriptive context is mapped into measurable vectors representative ofthe context but now expressed in manipulative mathematical form. Inother words, the embedding process is designed such that subjectivedescriptions which are semantically similar are viewed in the resultingvectoral (semantic) space as correspondingly similar. The resulting useof the embedding permits for improved and more reliable recommendationsand responses to user queries presented as an uncoded/raw data file tothe system intelligence.

Advantageously, the present invention provides an innovative methodologyfor data categorization and, more particularly, a system and method thatpermits rapid assimilation of user-perceivable qualities betweenoriginal data and possible relevant search data, e.g. detection of audioor sections of an audio file that are likely to warrant a listener'sinterest. The approach applies equally to image, video, text and speechdata files or a combination of two or more of these file types.

A preferred embodiment, amongst other things, provides a track finder ortrack recommendation tool that is able to consistently characterize afile by distilling out identifiable properties in a section thereof, andthen to identify other files that commonly share those characteristicsand/or subjective qualities.

Given the number of accessible files, including variations that cansubtly or significantly change the original file, within data libraries(whether personal ones containing hundreds or a few thousand files orcommercial libraries having millions of files for commercial streaming,download or reference) the present invention provides a useful andeffective recommendation tool that hones search results for files basedon ranking of perceived similarities in qualities and is thus able todisregard arbitrary categorization and rather to focus on perceptivequalities/similarities.

The search and recommendation tools of the various embodiments thereforebeneficially reduces the need for extensive review of files to identifynew content (in new data files) that is consistent with the user'sparticular and subjective tastes, i.e. the search and recommendationtool reduces the search space by identifying user-orientatedperceptually relevant data from candidate data files. Moreover, throughobjective and technically qualified assessment, the embodiments of theinvention provide increased and more rapid access to a greater range ofcontent that is stored or accessible through libraries, especiallysubscriber-accessible on-line libraries or server stores, therebylending itself to improving both end-user selection and end-user accessto content through qualified recommendation. The embodiments of theinvention can therefore mitigate the issues of cold start by promotingnew files, artists, creators or avenues for research to a more selectiveand likely more receptive user base based on perceptually similarproperties in file content.

The principles apply to the identification of other contextuallydescribable subjective works that act as a source forcomputer-implemented data analysis, including music, images, text and/orvideo.

Various aspects and embodiments of the invention as outlined in theappended claims and the following description can be implemented as ahardware solution and/or as software, including downloadable code or aweb-based app.

BRIEF DESCRIPTION OF THE DRAWINGS

The application file contains at least one drawing executed in color.Copies of this patent application publication with color drawings willbe provided by the Office upon request and payment of the necessary fee.

Exemplary embodiments of the present invention will now be describedwith reference to the accompanying drawings in which:

FIG. 1 represents a flow diagram of a preferred process to assessdissimilarity of files and, particularly, audio files, and a process bywhich an artificial neural network may be trained according to thepresent invention;

FIG. 2 is a schematic representation of a system architecture fortraining an artificial neural network according to a preferredembodiment;

FIG. 3 is a flow diagram relating to a preferred process of training theneural network of FIG. 2 to assimilate semantic vector space withproperty vector space to identify property similarities and propertydissimilarities between source files;

FIG. 4 is a presentation of a typical mel-spectrum for an audio track;

FIG. 5 is illustrative of convolutional and pooling layers within anartificial neural network assigned to mel-spectrum interpretation;

FIG. 6 is a representation of an artificial neural network employedwithin the various ANN chains of FIG. 2.

FIG. 7 is a flow process employed by a preferred embodiment to assess ameasure of emotionally-perceptive file dissimilarity, especially in thecontext of an audio file;

FIG. 8 is a network architecture, including an accessible databasecontaining vector representation according to a preferred embodiment;

FIG. 9 illustrates two exemplary embeddings in the context of exemplaryvideo file assessment;

FIG. 10 shows a functional architecture of a recommendation systemimplementing the preferred method of FIG. 7;

FIG. 11 shows a functional architecture of a hybrid recommendationsystem including multiple recommendation layers according to anembodiment of the present invention;

FIG. 12 shows a functional architecture of an alternative hybridrecommendation system including multiple recommendation layers accordingto an embodiment of the present invention;

FIGS. 13 and 14 show functional architectures of file-centric anduser-centric recommendation systems according to embodiments of thepresent invention;

FIG. 15 shows a functional diagram of a tagging system and contentfilter according to an aspects of the present invention; and

FIG. 16 shows a functional architecture of a source-to-target playlistgeneration system.

FIG. 17 shows line distances between data files represented as circlesand in which diagram only one such relationship is shown to five datapoints.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

In order to provide a tool, such as accessed through a web-browser orlocal app, that evaluates semantic similarities or dissimilaritiesbetween (for example) audio tracks, it has been recognised that it isnecessary to make use of deep-learning and artificial intelligence toidentify similarities between semantic meaning, processed to provide afirst metric in semantic space, and extracted measurable properties forcontent of the same data source in a different measurable space, such asEuclidean space (although other dimensional spaces may be used). Thisprocess effectively provides a translational mapping between thesimilarities in semantic meaning in one space and similarities inextracted measurable properties in another space.

More particularly, it has been recognized that a measure ofemotionally-perceptive similarity or dissimilarity (especially in theexemplary sense of a digital audio file, image file or other perceptiveaesthetic creation in digital form) cannot be derived from hard datafields alone, e.g. quantized representations of signal quality, sincesuch hard data does not provide for any interpretation that isexperienced by a human-reviewer, e.g. a listener. In other words,feature extraction in isolation does not give a sufficiently accurateobjective assessment of emotionally-perceived similarity ordissimilarity because quantised representations (whether in isolation orgrouped) of signal qualities do not provide any relationship into theemotive real-world.

The present invention therefore functions, initially, in the context ofone or more trained artificial neural networks ANNs that [relative todeficient earlier entirely de-coupled and objectively assessedapproaches] are functionally arranged to map, i.e. associate or couple,subjectively-derived content descriptions expressed in semantic space tomeasurable properties extracted for the same pair of contrasted files asexpressed in Euclidean space, thereby correcting for the de-couplingthat currently exists between feature extraction and human intuitionand/or human emotive perception of similarity or dissimilarity in,particularly, subjectively-evaluated/perceived data, e.g. music.

The effect of the neural network functions is to create two independentvectors that both purport to represent emotionally-perceivable ordocumented dissimilarities in digital audio and/or image data and/orliterally work, but in different vector spaces. The first vector insemantic space is based on the human descriptions of source files andthus carries significantly higher contextual weight. The first vector istherefore used to assess and correct the second vector in, for example,Euclidean space, thereby allowing convergence—through changing ofweights in the ANN—of the output of a different neural network to thesemantic result of the first neural network. The Euclidean vector isalso derived from selected subjective properties extracted from theoriginal source data, e.g. pairwise comparison of songs, duringdeep-learning in artificial neural networks.

Following training, the convergence process provides, ultimately, atransformative function in the ANN that permits any data file to beassessed relative to other pre-assessed data files to assess similarityin semantic and emotionally-perceivable content.

As such, at least during a training phase for an artificial neuralnetwork, two independent vectors are generated for a common source. Thefirst vector is semantically based and derived from (typically)associated metadata for the source data/file and the second vector isextracted from the main content (e.g. payload) of the source/data file.Whilst these two vectors—the first based on human judgment and thesecond extracted from hard, identifiable and absolute measurableproperties—should be identical, they may not be. Consequently, toproduce a truly representative predictive tool that assessesemotional/perceptive dissimilarity or closeness, it is necessary thatprocessing of the absolute measurable properties eventually leads to anidentical result to processing of the human judgment, i.e. semantic,qualities. In order to reflect true emotive perception, the assessmentrelating to human judgment is of higher importance and trumps theabsolute evaluation of identifiable and measurable tangible propertiesthat are both obtained from the common source. Forcing a change inapplied weights and bias values in an artificial neural network thatprocesses the identifiable and measurable tangible properties obtainscloser alignment with reality, as reflected by human intelligence,judgment and perceptive reasoning.

1. Similarity/Dissimilarity Assessment of Contextual Explanation inSemantic Space

An initial semantic description of the nature of the file, e.g. acontextual written description including context in a sentence and theuse of particular words, is firstly converted or “embedded” into amulti-dimensional semantic vector using, for example, natural languageprocessing “NLP” techniques and the like. The contextual writtendescription amounts to a metric of human judgement which is subjective,perceptive and/or emotionally-based.

NLP, as supported by (for example) the Universal Sentence Encoder fromGoogle® and particularly the Tensorflow™-hub, encodes text into highdimensional vectors that can be used for text classification, semanticsimilarity, clustering and other natural language processing tasks. Inpractical terms, NLP processing of two semantically similar descriptionswill yield vector representations that are similar.

Whilst there may be some diversity in textual descriptions fromdifferent annotators, these are not considered statistically significantgiven the nature of the processing that is undertaken.

The choice of the commuting process between text and a vectorialrepresentation is a design option, e.g. processing using Tensorflow™ maybe based on training with a Transformer encoder or alternatively a DeepAveraging Network (DAN). The associated vector, in semantic space, istechnically important from the perspective of overall training.

The semantic vectorization process is applicable to other forms of mediadata, such as image data in the form of a painting or film, that hassemantic properties and corresponding aesthetic descriptors that can beconverted in the numerical representation.

During the training sequence, an NLP-derived multi-dimensional vector iscompared, on a pairwise basis, with other NLP-derived vectors toidentify, in semantic vector space, a separation distance representationof pairwise semantic closeness. This firstly establishes a user-centricperception of pairwise closeness. In this sense, it will be appreciatedthat use of the terms “semantic” and “semantic space”, etc., reflectthat the origin of any corresponding vector or value stems from asubjectively-prepared description of human perceptual or emotive (i.e.semantic) qualities of the content of a file, e.g. audio track.

The preferred use of NPL provides an initial mapping between textualdescriptors and a vector value in semantic space. The same principlecould be applied to categorization of other media, e.g. video, films,paintings, fashion in the exemplary sense of clothing and decoration(with properties being in terms of colours and patterns and texture forcoverings and the like) as well as medical records that may includeimages.

To provide a context in terms of musicology, taking Rimsky-Korsakov's“Flight Of The Bumblebee” as a first audio training track, this audiotrack may be described in two word dimensions as “frenetic” and “light”with NLP ascribing a vector representation of 1004512112 for trackscontaining only these two NLP-resolved terms. Of course, the number oflinguistic dimensions can be more than two and so the audio track'sdescription could be expanded to include other semantic associationsarising, for example, with (i) temporal events, such as dusk, Easter,cloudy, etc., and/or (ii) feelings, and/or (iii) themes, e.g. fairy-taleor fact and/or (iv) environments.

The vector “1004512112” is merely provided as an arbitrary example and,in fact, the generated multi-dimensional vector may take an entirelydifferent form, especially since the number of word/sentence dimensionsis only limited by the semantic associations that can be derived fromthe descriptive sentence for the audio track.

The process is repeated for a high number of independent samples, e.g.typically many thousands and preferably at least about ten thousand ormore, to assemble a multi-dimensional matrix for the audio track-findingapplication which is used to provide a contextual example. Therefore,semantic similarity/dissimilarity is established between all trainingtracks, such as the aforementioned Flight Of The Bumblebee and, say, theelectronic song “White Flag” by Delta Heavy or “Boulevard of BrokenDreams” as performed by Green Day. The size of the training set is,however, a design option driven by processing capacity, time and adesired level of achievable confidence/accuracy. Rather than to assessall pairs, an option is to select extreme variations in pairwisedistance measures to train the ANN.

A resultant semantic first vector will be assembled from, in a preferredembodiment, at least a multiple of 64 individual dimensional components(although the precise number is reflective of implementation choice anddesired accuracy). When using the Tensorflow™ universal sentenceencoder, the processing of the semantic description yields a vector (insemantic space) of five hundred and twelve dimensions. Consequently, theprecise semantic vector length is a design option and may vary.

It does not matter whether the semantic vector and the property vector(described in more detail below) are of the same size since the systemconsiders dissimilarity as assessed on a pairwise basis.

2. Distance Assessment Based on Extracted Properties

In generating the second independent vector in a second training processbased on derived “properties” (as contrasted with semantic descriptionsof the file used for pairwise semantic closeness outlined immediatelyabove and described in detail in section 3 below), the weighting factorsapplied to nodes in layers of the neural network are changed bybackpropagation to converge the results in (typically Euclidean)property distance space towards those of the semantic (typicallyEuclidean) separation distances (in semantic space) and thereforeintrinsically back to the original semantic description(s).

As indicated earlier, the vector space for the first and second vectorsis different in the sense that, although from a common source and onefile, the input qualities of the input data that is to be processed aredifferent. Processing of subjective description material by NLP cantherefore be considered to yield the first vector in semantic space (orsemantic distance space), whereas processing of absolute values relatingto identified properties (even is these properties can be expressed indifferent selectable numeric terms for signal properties) yields, as anoutput of the ANN, a second vector in “property space”.

In a preferred embodiment, Euclidean space is used as opposed to readilyappreciated alternatives, i.e. non-Euclidean geometries.

An artificial neural network functions to convert measurable propertiesof a source file into a manipulable vectorial representation thereof.This conversion produces a second independently-generated vector, i.e.the second vector. This conversion can be considered as “featureextraction”. In a preferred embodiment (in the exemplary case of audioprocessing), feature extraction is achieved using Essentia™ appdeveloped by the Music Technology Group at Pomepu Fabra University (seehttps://essentia.upf.edu/documentation/streaming_extractor_music.html).

Essentia™ (or its functional equivalent) is an existing libraryproviding a foundation for the analysis of a source audio file toidentify a multiplicity of audio descriptors, such as band energies,band histograms and other measurable music qualities of the sourcetrack. In Essentia™, these audio descriptors number up to one hundredand twenty-seven. The audio descriptors can each be considered to be aquantized representation of a measurable parameter of the audio signal.

Returning to the exemplary context of an audio file, the processingintelligence behind Essentia™—in a like manner to equivalentcategorization mechanisms—provides for feature extraction from thesource file. Selection of appropriate ones of the audio descriptors in asubset define broader musical aspect or quality of each audio track,e.g. a first subset of measured quantized representations [nominally]from audio descriptor bins 1, 15, 32, 33 and 108 (from the possibletotal universal set of 127 audio descriptors in Essentia) might becombined by the programmer to define “rhythm”, whereas a subset ofmeasured quantized representations from audio descriptors 5-21, 43, 45,50, 71-77 and 123-127 could define “timbre” and a third different subsettonality, i.e. tonal quality of the performance The subsets thereforeprovide further semantic properties in the musicology of the sampledsource audio track.

For other forms of source file, such as video or image files,alternative measurable parameters are parsed from the source file todefine alternative usable qualities.

As indicated, in the context of audio and particularly audio properties,a piece of music can be described using timbre, rhythm, tonality andtexture. The properties of timbre, rhythm and tonality are particularlyimportant.

3. Measurable Musical Properties

In this respect, it will be appreciated that:

“TEXTURE” is generally reflected by two-dimensional patterns in thetime-frequency space which relate to the temporal evolution of thespectral content. Texture is therefore seen in a mel-spectrograph ormel-spectrum that plots the frequency domain against the time domain.Within such a mel-spectrum, evolving texture can be learnt by a neuralnetwork (as described subsequently) by identifying patterns that evolvewith time, such as for example (i) interrupted horizontal spectral linesin high/mid-range frequencies, (ii) parallel vertical spectral linesstretching the mid and high-frequency range, and (iii) ascending ordescending steps in the low-mid frequency range. Texture thereforeprovides a further complementary semantic property that is useable, inthe context of the present invention, to assess tracksimilarity/dissimilarity through provision of a further measurablemetric in property space.

“RHYTHM” can be considered as the arrangement of notes according totheir relative duration and relative accentuation (seehttps://www.naxos.com/education/glossary.asp?char=P-R#). As will beappreciated, rhythm can be expressed in terms such as (but not limitedto):

-   -   i) beats loudness as computed from beats and musical spectrogram        with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_BeatsLoudness.html        and        https://essentia.upf.edu/documentation/reference/std_BeatTrackerMultiFeature.html);    -   ii) beats per minute “BPM” (see https://see        essentia.upf.edu/documentation/reference/std_BpmHistogramDescriptors.html        and        https://essentia.upf.edu/documentation/reference/std_BeatTrackerMultiFeature.html);    -   iii) BPM histogram as computed from the signal with aggregations        reflecting first and second peak heights and spread (see        https://essentia.upf.edu/documentation/reference/std_BpmHistogramDesc        riptors.html, and        https://essentia.upf.edu/documentation/reference/std_BeatTrackerMultiFeature.html);    -   iv) danceability (see        https://essentia.upf.edu/documentation/reference/std_Danceability.html);    -   v) onset rate (see        https://essentia.upf.edu/documentation/reference/std_OnsetRate.html);        and    -   vi) band-wise beats loudness as computed from beats and musical        spectrogram as reflected by mean values and variance over six        bands (see        https://essentia.upf.edu/documentation/reference/std_BeatsLoudness.html        and        https://essentia.upf.edu/documentation/reference/std_BeatTrackerMultiFeature.html).

Whilst the property of rhythm is, in Essentia terms, suggested as acollection of six measurable attributes, it will be appreciated that, infact, more than six measurable attributes can contribute to thisproperty, as reflected (for example) by the references to mean andvariance values of specific musicological attributes. It will beunderstood by the skilled addressee that the multi-dimensional vectorthat is compiled for the property rhythm may therefore vary from thesuggested Essentia parameters and be formed from other measurableattributes that provide a musicologically workable definition of rhythm.In a preferred embodiment, nominally nineteen (19) measurable attributesare assigned to the concept of rhythm, although other numbers ofattributes can be used.

“TONALITY” is the arrangement of pitches and/or chords of a musical workin a hierarchy of perceived relations, stabilities, attractions anddirectionality. In this hierarchy, the single pitch or triadic chordwith the greatest stability is called the tonic. Tonality is thereforean organized system of tones (e.g., the tones of a major or minor scale)in which one tone (the tonic) becomes the central point for theremaining tones and where the remaining tones can be defined in terms oftheir relationship to the tonic. Harmony is a perceptual tonal quality.

As will be appreciated, tonality can be expressed in terms such as (butnot limited to):

-   -   i) chords change rates as computed from Harmonic Pitch Class        Profiles (HPCP) of the spectrum (see        https://essentia.upf.edu/documentation/reference/std_ChordsDescriptors.html;    -   ii) chords number rate as computed from HPCP (see        https://essentia.upf.edu/documentation/reference/std_ChordsDescriptors.html);    -   iii) chords strength as computed from HPCP with aggregations        reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_ChordsDescriptors.html);    -   iv) HCPC entropy as computed from HPCP with aggregations        reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_HPCP.html,        and        https://essentia.upf.edu/documentation/reference/std_Entropy.html;    -   v) key strength as computed from HPCP (see        https://essentia.upf.edu/documentation/reference/std_KeyExtractor.html);    -   vi) tuning diatonic strength as computed from HPCP (see        https://essentia.upf.edu/documentation/reference/std_TuningFrequency.html);    -   vii) tuning equal tempered deviation as computed from HPCP (see        https://essentia.upf.edu/documentation/reference/std_TuningFrequency.html);    -   viii) tuning non-tempered energy ratio as computed from HPCP        (see        https://essentia.upf.edu/documentation/reference/std_TuningFrequency.html);        and    -   ix) chords histogram as computed from HPCP (see        https://essentia.upf.edu/documentation/reference/std_ChordsDescriptors.html).

Whilst the property of tonality is, in Essentia's term, suggested as acollection of ten measurable attributes, it will be appreciated that, infact, more than ten measurable attributes can contribute to thisproperty, as reflected by the references to mean and variance values ofspecific musicological attributes. It will be understood by the skilledaddressee that the multi-dimensional vector that is compiled for theproperty tonality may therefore vary from the suggested Essentiaparameters and be formed from other measurable attributes that provide amusicologically workable definition of tonality. In a preferredembodiment, nominally thirty-three (33) measurable attributes areassigned to the concept of tonality, although other numbers ofattributes can be used, with these obtained from an application ofgreater or lesser granularity of quantized measurement. For example, the“chords histogram” is implemented as a twenty-three-dimensional vector.

In terms of Essentia's treatment of another measurable attribute “chordsstrength”, this is computed through parsing the audio file with a movingwindow (frame) and, from each window (frame), extracting a value toyield a sequence of numbers (on a one number per frame basis). Thesequence is, in turn, used to compute its mean and variance. Therefore,in a preferred embodiment, the measurement “chords strength” isrationalized to just two numbers, i.e., the mean and variance of theaforementioned sequence. This example shows how measurement values thatare used in assessment of an identified property can depart from therecommendations made in Essentia, albeit that the multi-dimensionalvector that is produced to reflect the property, e.g. rhythm ortonality, contains sufficient spectral information to provide auser-acceptable definition of the property.

“TIMBRE” is a relatively esoteric measure and manifests itself in thecomplexity of the sound which can in turn be measured via thespectrogram of the sound. Timbre is the perceived sound quality of amusical note, sound or tone. Timbre distinguishes different types ofsound production, such as choir voices and musical instruments, such asstring instruments, wind instruments, and percussion instruments. Italso enables listeners to distinguish different instruments in the samecategory (e.g. an oboe and a clarinet, both woodwind instruments).Physical characteristics of sound that represent the perception oftimbre include the sound spectrum and the signal envelope, with timbrepermitting an ability to resolve sounds even in stances when the soundshave the same pitch and loudness.

As will be appreciated, timbre can be expressed in terms such as (butnot limited to):

-   -   i) barkbands_crest as computed from the barkband-filtered        spectrogram with aggregations over mean and variance for        identified Bark frequency ranges (see        https://essentia.upf.edu/documentation/reference/streaming_Crest.html        and        https://en.wikipedia.org/wiki/Bark_scale#Bark_scale_critical_bands);    -   ii) barkbands_flatness_db as computed from the barkband-filtered        spectrogram with aggregations over mean and variance for        identified Bark frequency ranges (see        https://essentia.upf.edu/documentation/reference/std_FlatnessDB.html);    -   iii) barkband_kurtosis as computed from the barkband-filtered        spectrogram with aggregations over the mean for identified Bark        frequency ranges (see        https://essentia.upf.edu/documentation/reference/std_DistributionShape.html);    -   iv) barkband_skewness as computed from the barkband-filtered        spectrogram with aggregations over mean and variance for        identified Bark frequency ranges (see        https://essentia.upf.edu/documentation/reference/std_DistributionShape.html);    -   v) barkband_spread as computed from the barkband-filtered        spectrogram with aggregations over the mean for identified Bark        frequency ranges (see spectral complexity as computed from the        audio signal's spectrogram with aggregations reflecting mean and        variance (see        https://essentia.upf.edu/documentation/reference/std_DistributionShape.html);    -   vi) spectral dissonance as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_Dissonance.html);    -   vii) dynamic_complexity as computed from the audio signal's RMS        envelope (see        https://essentia.upf.edu/documentation/reference/std_DynamicComplexity.html);    -   viii) high frequency content as computed from the audio signal's        spectrogram with aggregation over the mean (see        https://essentia.upf.edu/documentation/reference/std_HFC.html);    -   ix) pitch salience as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_PitchSalience.html);    -   x) spectral complexity as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_SpectralComplexity.html);    -   xi) spectral energy high frequencies as computed from the audio        signal's spectrogram with aggregations reflecting mean and        variance (see        https://essentia.upf.edu/documentation/reference/std_EnergyBand.html);    -   xii) spectral energy low frequencies as computed from the audio        signal's spectrogram with aggregations reflecting mean and        variance (see        https://essentia.upf.edu/documentation/reference/std_EnergyBand.html);    -   xiii) spectral energy mid-high frequencies as computed from the        audio signal's spectrogram with aggregations reflecting mean and        variance (see        https://essentia.upf.edu/documentation/reference/std_EnergyBand.html);    -   xiv) spectral energy mid-low frequencies as computed from the        audio signal's spectrogram with aggregations reflecting mean and        variance (see        https://essentia.upf.edu/documentation/reference/std_EnergyBand.html);    -   xv) spectral entropy as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_Entropy.html);    -   xvi) spectral flux as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/streaming_Flux.html);    -   xvii) spectral kurtosis as computed from the audio signal's        spectrogram with aggregation over the mean value (see        https://essentia.upf.edu/documentation/reference/std_DistributionShape.html);    -   xviii) spectral strong peak as computed from the audio signal's        spectrogram with aggregations reflecting mean and variance (see        https://essentia.upf.edu/documentation/reference/std_StrongPeak.html);    -   xix) zero crossing rate as computed from the audio signal and        with aggregations over mean and variance (see        https://essentia.upf.edu/documentation/reference/std_ZeroCrossingRate.html);    -   xx) MFCCs as computed from the audio signal's spectrogram with        aggregation over the mean (see        https://essentia.upf.edu/documentation/reference/std_MFCC.html);        and    -   xxi) spectral contrast as computed from the audio signal and        with aggregations over mean and variance of both peaks and        valleys (see        https://essentia.upf.edu/documentation/reference/std_SpectralContrast.html).

Whilst the property of timbre is, in Essentia's term, suggested as acollection of twenty-one (21) measurable attributes, it will beappreciated that, in fact, more than twenty-one measurable attributescan contribute to this property, as reflected by the references to meanand variance values of specific musicological attributes. It will beunderstood by the skilled addressee that the multi-dimensional vectorthat is compiled for the property timbre may therefore vary from thesuggested Essentia parameters and be formed from other measurableattributes that provide a musicologically workable definition of timbre.In a preferred embodiment, nominally seventy-five (75) measurableattributes are assigned to the concept of timbre, although other numbersof attributes can be used, with these obtained from an application ofgreater granularity in measurement, as indicated above and as will beunderstood by a musicologist.

In the context of audio track assessment and track-finding, theproperties of tonality, rhythm and timbre importantly provide a basis bywhich measurement of subjective qualities of a source file can beassessed objectively. These properties may be derived from Essentia™attributes, as identified above, or a subset of those Essentia™ signalattributes or from an equivalent library identifying suitable audiodescriptors. Indeed, as will be appreciated, the present inventionselects nineteen, thirty-three and seventy-five quantisedrepresentations for the properties of rhythm, tonality and timbre, withsome of these overlapping with the Esssentia™ tool-box whereas other arevariants or different signal measures. Consequently, the number ofquantized representations is not fixed, but rather variable according tothe musicologist belief concerning what signal attributes are requiredto define the particular properties that are being assessed.

Given the above, it is a design option as to how a skilled personselects—or indeed which—measurable attributes to define a suitableproperty for use in an assimilation process. The property of rhythm, forexample, may be reviewed to include or exclude certain of the Essentiameasurements, so in some respects it is understood that whilst theassessed properties are technical in nature and are measurable byexisting technical processes, the lack of a consistent definition ofwhat amounts to a “property” is unsurprising but not technicallyrelevant. Rather, properties of the content of the file are to a degreeboth esoteric and subjective. However, it is the mapping of definitiveyet subjectively assembled measurable in property space into anindependent yet entirely relevant and corresponding semantic assessmentin semantic space which is important.

4. Artificial Neural Network (ANN)

In accordance with concepts of the various aspects and embodiments ofthe present invention, pairwise similarity/dissimilarity in propertyspace is mapped back to initial semantic similarity/dissimilarity (e.g.expressive and subjective linguistic descriptors) in semantic space.This is a multi-stage process that may involve multiple neural networksrunning in parallel. The use of multiple parallel ANNs permits controlof musical modality, whereas use of a single ANN is possible. Aspects ofthe invention are concerned with training of the neural network thatprocesses the extracted properties and evaluates dissimilarity in theproperty space.

FIG. 1 represents a flow diagram of a preferred process 100 to assessdissimilarity of files (and particularly audio files) and a process bywhich an artificial neural network may be trained according to thepresent invention. FIG. 1 therefore corresponds to and expands upon theprocess described above in relation to section “1:Similarity/Dissimilarity Assessment of Contextual Explanation inSemantic Space”.

Audio files are used as an example of the underlying process since audiofiles, especially music files, can be subjectively interpreted fromapplied individual human perception.

As a training set of many hundreds (and preferably many thousands) ofsource files, pairs of files are selected 102 and semanticallycontrasted through ANN assessment. In a first path, using NLP, anartificial neural network extracts 104, i.e. processes togenerate/embed, a representative vector for the semantic meaningconveyed in associated textual metadata (or as an accompanyingdescription) for each file, e.g. each audio track of the pair. Thisresults in, typically, the production 106 of a five hundred and twelve(512) dimensional vector from Tensorflow™ (or the like) that expressesthe derived semantic meaning as a manipulatable value that can beevaluated.

The ANN can therefore effectively tabulate vectorial separationdistances between all N files in the training set, where N is typicallymore than five hundred files and generally considerably more thanseveral thousand. The more samples in the training sequence, the greaterthe granularity and associated confidence, albeit that higher numbers ofsamples increases processing complexity. In short, the more samples thebetter. However, as an option to train the ANN, the process may make asub-selection of pairs where distance separations indicate that they areeither very similar or very dissimilar, i.e. training may be based onextreme conditions.

At this point, tabulation of relative distance separation is abstract inthat, whilst absolute distances exist in terms of separation distancevalues (e.g. five measurement units, fifty-seven measurement units or1013 units), they do not reflect a scaled value of similarity/semanticcloseness in multi-dimensional space. Assuming that N is sufficientlylarge, it has been recognised that for each file (track) there exists atleast a reasonably certain number m of those N files (where is apositive integer and m<<N) will be similar or dissimilar.

In a preferred embodiment, for each source file in the training set,e.g. song “A”, an arbitrary number, say ten, closest vectors in semanticdistance vector space are selected; this forms a group or cluster ofclosely semantically-related songs. Statistically, in a training set ofseveral thousand or perhaps a few tens of thousands of source files,clustering together [as equivalent] 0.1% of the universe isstatistically acceptable in terms of likely semantic closeness. Indeed,relative to the universe of songs in a reasonable training sequence,closeness may be viewed to be in the range of between about 0.05% andabout 1%, although with increasing percentage values the likelyuser-perception of audio dissimilarity will increase.

For a song “A”, the system intelligence is arranged to consider the “m”(e.g. the ten and where m≥1) nearest songs as semantically similar inthe sense of being user-perceptually close. This is reflected bysetting—and then recording in a data record—a distance between these msongs around the vector for song “A” to be zero. For all songs outsidethe m closest, the system intelligence is arranged to consider thesesongs as dissimilar, i.e. that these other (not m) songs as semanticallydissimilar in the sense of being user-perceptually far apart.Consequently, dissimilar songs are identified, relative to song “A”, ashaving a distance of one. Therefore, for each assessed audio track, 2*mpairs of records are created and stored by the system as a retrievableand accessible record. Selection of an equal value of m ensures thattraining of the neural network is not biased by one extreme of the other(in terms of similarity or dissimilarity).

The processing burden on the ANN can, in most cases, be rationalised 114at some point in the training process, as will be understood.Specifically, optimized training of an ANN is achieved through trainingwith extreme cases, rather than with a bulk of similar values.Consequently, for any pairwise association, taking farthest apart andclosest separation distances reduces time to hone the applied weightsapplied to neurons in the ANN.

A first semantic reference in the form of a “first vector”, as outlinedabove in section 1: Similarity/Dissimilarity Assessment of ContextualExplanation in Semantic Space is thereby established as a reference forANN training.

Returning to the original source files (e.g. audio tracks), a secondpath 126 for evaluation and assessment again looks, on a pairwise basis,for indicative patterns across the entire training space of N files(e.g. N audio tracks). Particularly, as indicated above in section 2:Distance Assessment based on Extracted Properties, the processundertakes feature extraction 130 of signal attributes by parsing thesource (audio) file pairs to produces bins of quantized representationsof signal qualities, such as explained above in section 3: SemanticProperties [in the specific exemplary context of audio/music].Individual bins of quantized representations of signal qualities arethen appropriately identified and selectively group together 132 todefine semantic/subjective musicological properties, i.e. rhythm,tonality, timbre and texture, that can be evaluated and manipulated inmore absolute terms in property space.

Reference is made to FIG. 2 and the process of FIG. 3.

FIG. 2 is a schematic representation of a system architecture fortraining a system including artificial neural networks according to apreferred embodiment. FIG. 3 is a flow diagram relating to a preferredprocess of training the neural network of FIG. 2 to assimilate semanticvector space with property vector space to identify propertysimilarities and property dissimilarities between source files.

On a pairwise basis, two files (e.g. digital audio files 302, 304) ofthe N files are selected from a training database 306 of files and aresubjective to assessment and interpretation by the system 300. Thesystem 300 may be embodied within a more general system intelligence,such as supported by a server or a distributed system of interactiveprocessors and includes a plurality of artificial neural networks.

As indicated above, initial processing of each selected audio file in afeature extractor 301 (such as Essentia or its functional equivalentwhether this be in the context of the exemplary case of audio fileprocessing or for a different format of source file, such as a picture)produces bins of quantized representations of signal qualities, withthese bins selectably grouped to define a plurality respective outputsrepresenting different semantic properties P, e.g. timbre “PTi”,tonality “PTo” and rhythm PR, in numeric terms. Value representationsfor each of these subjective properties for each audio track (e.g. PTo₂for the property of tonality extracted from track 2) are appliedcommonly as inputs to dedicated parallel neural networks for weightoptimization in the evaluation process for each property.

In the exemplary context of an audio file and track finding system,there are independent ANNs for rhythm “NN_(R)” 310, tonality NN_(TO)312, timbre NN_(TI) 314 and musical texture NN_(TX) 318.

Musical texture is a special case and requires a different process flow.Musical texture is discussed below in more detail.

For processing and evaluation of other training data, such as images,there may be more or fewer parallel ANN chains. The ANN chains, shown tonumber four in FIG. 2, can be considered as independent processingpaths, branches or pathways and thus sub-networks of the network). Thenumber relates only to the number of semantically discernibleproperties. The system may, in fact, operate with just a single chainthat processes data in multiple passes to arrive at a composite resultsuitable for evaluation.

The ANN for rhythm “NN_(R)” 310 thus receives an input representationonly of the property rhythm, with this being assembled (in a preferredembodiment) from a vector of nineteen components, i.e. nineteenextracted signal attributes. The ANN for tonality “NN_(TO)” 312 thusreceives an input representation only of the property tonality, withthis being assembled (in a preferred embodiment) from a vector ofthirty-three components, i.e. thirty-three extracted signal attributes.The ANN for timbre “NN_(T1)” 314 thus receives an input representationonly of the property tonality, with this being assembled (in a preferredembodiment) from a vector of seventy-five components, i.e. seventy-fiveextracted signal attributes.

As indicated above, the definition of each property can vary in terms ofthe number and/or attribute nature of the extracted signalrepresentation for each bin. Therefore, in the express context of audiofiles and the use of Essentia, all of the available attribute signalbins (including, for example, barkbands_flatness_db anddynamic_complexity for timbre) may be used, some may be used or othersnot mentioned above may be used in place of or otherwise extent thenumber. The definition of a “property” is therefore subjective (to someextent), although this subjectivity is irrelevant if a consistentapproach to a property's definition is adopted. In other words, theprogrammer is able to determine how to define a subjective property byidentifying and selecting desired measurements for signal attributes.

The ANNs for rhythm “NN_(R)” 310, tonality NN_(TO) 312, timbre NN_(TI)314 and musical texture NN_(TX) 318 therefore determine and refineweight values that account for differences in these properties, withweights and biases refined by an iterative process involving theentirely of the training set and a backpropagation algorithm tasked tofind the appropriate adjustments for each trainable parameter. Theprocess of backpropagation is understood by the skilled addressee so itis relevant to point to the intent of what is to be aligned and theobjectives and benefits achieved by the architecture and process asdescribed herein.

It has been recognized that the issue of musical texture also has a partto play in the assimilation of content property metrics (derived fromvectorial representations of measurable properties of each track inpairwise comparison) to semantic metrics (derived from vectorialrepresentations of sematic descriptions of each track in pairwisecomparison).

The approach adopted by the embodiments of the present inventiontherefore emphasises the importance of human emotional perception overstrict machine-learning, thereby weighting operation of an ANN towardshuman-perception rather than statistical mapping based on interpretationof absolute numeric data.

Turning briefly to FIG. 4, a typical mel-spectrum 500 is shown for anaudio track. As will be understood, a mel-spectrograph (interchangeablyknown as or referred to as a mel-spectrum) is a quasi-logarithmicspacing roughly resembling the resolution of the human auditory systemand thus a more “biologically inspired” perceptual measure of music. Themel-spectrum is a representation of the short-term power spectrum of asound across a frequency spectrum, based on a linear cosine transform ofa log power spectrum on a nonlinear mel scale of frequency. In themel-spectrum, consideration of a power spectrum in a frequency binbetween (nominally) 50 Hz to 100 Hz would equate to consideration of apower spectrum across a larger frequency range at higher frequency, e.g.400 Hz to 800 Hz but also 10 kHz to 20 kHz. The process of how amel-spectrum is generated is well-known, because these frequency binsare perceptually of equal importance in musical interpretational terms.

Moreover, whilst noting that audio tracks can have musical themes thatchange on a section-by-section basis and which could thus affect themel-spectrum, for the sake of explanation of a preferred embodiment itis assumed that the theme in the audio—and therefore the excerptedwindow—is relatively constant. Of course, the alternative is topartition an audio track, such as Queen's “Bohemian Rhapsody”, intosections that are each subject to a discrete evaluation process insemantic space.

Not only is the mel-spectrum just a partial sample, but it is alsocomplex in nature in that it has dimensions in both the time domain andthe frequency domain Within the resulting 2-dimensional matrix of timedomain and frequency domain components, a theme can be identified byisolation of patterns of interest. Such patterns of interest can beobserved within the spectral components of a plot of frequency(ordinate) against time (as abscissa): i) parallel vertical lines 502stretching across the mid and high frequency range; ii) interruptedhorizontal lines 504 in the high-mid frequency range; iii) ascending 506or descending 508 steps in the low-mid frequency range. Other patterns,as will be understood, also exist with the mel spectrum with thesediscoverable

The property texture can therefore be derived from analysis of themel-spectrum and, particularly, identification of patterns and trends byan ANN that provides additional vectorial components in property spacethat are used in the training of the system 300 of FIG. 2.

An output from each ANN, including a contribution for texture, for eachtrack used in the training sequence/training data set is then assembledas an output, in property space, into a multi-dimensional output vectorconcatenated or otherwise assembled from multiple outputs OR_(x),OTO_(x), OTI_(x) and OTX_(x) (where x represents the related tracknumber, i.e. track 1 or track 2) for each property for each track. Theprecise length of each output vector is open to a degree of designfreedom, noting that its length is selected to be sufficient to allowfor objective evaluation and differentiation in property space. In apreferred embodiment, each essentially parallel-processed output fromeach ANN chain contributes a sixty-four-dimensional output vectorOR_(x), OTO_(x), OTI_(x) and OTX_(x) for each of the properties ofrhythm, tonality, timbre and texture (the latter of which requires adifferent processing as will be explained below).

Referring again to FIG. 2, a mel-spectrum 500 is generated for each oneof the selected pairs of files (in this exemplary case digital audiotracks) 302, 304. This process is well understood by the skilledaddressee. Both tracks are firstly subjected to processing within aconvolutional neural network “CNN” 320, with individual vector outputsfor each track then subjected to processing and interpretation with anassigned ANN (NN_(TX) 316) for texture evaluation. NN_(TX) 316 istherefore in a parallel with the other neural networks responsible forevaluation and embedding of vectors for rhythm, tonality and timbre.Respective vector outputs OTX₁, OTX₂ for tracks 1 and 2 from NN_(TX) 316are, in a preferred form, also sixty-four dimensional vectors, with eachof these outputs then concatenated or otherwise assembled with the threeother vectors for each track (labelled OR_(x), OTO_(x), OTI_(x)) toproduce a two-hundred and fifty-six dimensional vector for each oftracks 1 and 2. This two-hundred and fifty-six dimensional vector—againthe precise length is a design option as indicated above—is theaforementioned “second vector in Euclidean space”.

System intelligence includes a comparator 330 that functions to evaluatedistance measures in property space (arising between the assembledcomposite second vectors for each of the paired tracks as assembled fromthe four outputs OR_(x), OTO_(x), OTI_(x) and OTX_(x)) withcorresponding distance measures in semantic space. The systemintelligence thus establishes an association between the two spaces. Asan example of how the system operates to compare distances betweenvectors, the system intelligence may utilise a squared-absolute distancecalculation.

The system intelligence then functions to contrast the first vector andsecond vectors with an operative view to have the second vector alignedwith the closeness assessment of the first vector. In other words, thesystem intelligence contrasts the semantic distance (based on textualevaluation) with a property distance. Putting this differently, thefirst vector in semantic space (based on the human descriptions ofsource files) is used to assess and correct the second vector(associated with extracted measurable properties of the content) inproperty space, thereby allowing convergence—through changing of weightsin the ANN—of the output of the secondary neural network to the semanticresult of the first neural network. The objective is that there-combined concatenated output [and, particularly, the evaluatedEuclidean property vector relating to differences 330 between trainingtracks] is also represented on a scale of zero to one, and neuralnetwork weights in each of the ANNs for rhythm “NN_(R)” 310, tonalityNN_(TO) 312, timbre NN_(TI) 314 and musical texture NN_(TX) 318) areadjusted so that the Euclidean property distance measure 330 tends to,i.e. preferably replicates, the semantic quantized distance. Otherscaling may be applied rather than hard levels in a quantizationapproach.

Particularly, the weight factors applied in each of the ANNs for rhythm“NN_(R)” 310, tonality NN_(TO) 312, timbre NN_(TI) 314 and musicaltexture NN_(TX) 318 are adjusted by an understood process ofbackpropagation so that the result of the Euclidean property distancemeasure 330 between comparative pairwise tracks/files tends towards—andideally eventually correlates with a high degree of accuracy to—thedistance measures in semantic space. As will be understood, the processof backpropagation therefore trains each neural network by adjustingapplied weights based on contrasting objectively measurable signalattributes used to define identifiable file properties.

The effect of evaluating two independent paths—the first processedentirely in semantic space and the second pushed into measured propertyspace based on measurable qualities of subjectively-assessedproperties—produces an emotionally-perceptive system that more closelyaligns with human perception of either closeness or dissimilarity. Theeffect, in the exemplary context of finding tracks between differentgenres of music, is that quantitatively more as well as qualitativelybetter associations are made between different tracks even when thosetracks may, upon initial inspection, objectively appear to be inradically distinct and unrelated music genres. This represent astep-forward in addressing problems such as cold start in a providing animproved and reliable recommendation tool that can push relevant contentto new or existing users. In fact, the process and system's architectureare emotionally perceptive to the extent that it permits languageindependent embedding of semantic meaning. This means that, for example,Chinese and English may be overlaid without affecting semanticinterpretation or the results.

As a further component to the assessment of semantic properties of anaudio work in objective Euclidean space, a mel-spectrograph is processedthrough a convolutional neural network “CNN” to produce a vectorcomponent representative of a subjective but complementary concept ofmusical “texture”.

FIG. 5 is illustrative of convolutional and pooling layers within anartificial neural network assigned to mel-spectrum interpretation and,particularly, the deep learning needed to identify important musicalpatterns and trends in the tracks under assessment. Convolutionalprocessing addresses the two-dimensional nature of the spectral inputmatrix 600.

As indicated, the mel-spectrum includes time-varying patters thatreflect texture that serves as a further component forsimilarity/dissimilarity assessment of properties in property space. Inorder to identify these textural trends in a 2-dimensionalmel-spectrogram, filters in the convolutional neural network are trainedto identify patterns with the mel-spectrogram and, particularly, toidentify optimized parameter values within each of these filters thatgenerate filter outputs that reflect a high degree of confidence in theidentification of patterns/trends in the input matrix. As such,parameters within each filter will be adjusted, as will be understood bythe nature of operation of ANNs, to permit each filter to detect aparticular input that is relevant to desirable subjective properties,e.g. rhythmic and/or melodic patterns, contained within the mel-spectrumof the tracks under investigation.

In this regard, the chain of processing in the ANN for texture includessequential convolutional layers. For example, layers 1, 3 and 5 may beimplemented as convolutional layers respectively with 128, 128 and 64neurons and with each filter having a kernel size [i.e. the size of thefilter matrix] of three (3). During training, on a stepwise basis acrossthe spectral input matrix 600, a filter 602 [having an initiallyuntrained and then a revised set of parameters] is advanced. By applyingthe filter 602 to input data, an output matrix 604 yields positive matchresults between input values in the overlaid matrix. For example, as asimplistic example:

In an iterative stage, the values of the parameters in the filter arethen altered and a the 2⁻D input re-run to determine whether the newfilter coefficients yield a better or inferior result for matches forthe same input data, e.g.

In progressing through all possible filter positions in the 2D inputdata, a further results matrix 604 of positive yield results isdeveloped; this is representative of the ANN trying to optimise filtercoefficients/parameters to maximize matches. In FIG. 5, the resultsmatrix of identifies that higher correlation with the filter 602—andtherefore a high match and higher likelihood of identification of aninteresting pattern in the input data—is experienced with values of four(4) relative to poorer matches indicated by zeros and ones.

As with any CNN, with more filters one can identify more patterns, butthis comes at the expense of requiring more parameters and a need formore training data.

Preferably, for reasons of expediency, each convolution is followed by amax pooling layer having a suitable kernel size, such as a 2×2matrix/kernel. The effect of the max-pooling approach is shown in thelower part of FIG. 5 in which a results matrix 606 is decimated togenerate a new smaller input matrix to be processed in the successiveconvolutional phase. As will be understood, max pooling looks at a blockof outputs and then rejects all but the highest value in the analysedblock on the presumption that lower values are statistically notrelevant in subsequent processing. In FIG. 5, applying a 2×2 max poolingapproach to a 4×4 input matrix from the preceding convolution stageyields four independent blocks, with each of those blocks containingfour (yield) values. The max pooling result is then a first 2×2max-pooled matrix 608 in which only the highest yield values areretained. This first 2×2 max-pooled matrix 608 is then input as into asuccessive convolutional layer. Consequently, max pooling reduces theoperative size of the matrix to reduce dimensionality over different(successive) layers of the ANN.

The use of the max-pooling approach increases computational efficiencysince, with each neuron introducing a parameter that requires learning,restriction of the input matrix size reduces the amount of data (thatotherwise is required to mitigate inappropriate granularity andinaccuracy in calculating parameters/weights).

The CNN therefore includes several convolutional layers typicallyinterspersed by a max-pooling layer.

An output of the last max-pooled layer is flattened, i.e. all matrixcolumns are concatenated to form a single vector which acts as the inputto the dedicated neural network for texture assessment, i.e. musicaltexture NN_(TX) 318.

Before discussing the general form and operation of the ANNs shownespecially in the context of FIG. 6, it is noted that the flattenedoutput from the CNN 230 is applied as (for example) a sixty-fourdimensional vector the input to a two-hundred and fifty-six neuronhidden layer of the dedicated texture neural network NN_(TX) 318preferably with a rectified linear unit (“ReLU”) activation function foroptimized deep learning. The texture neural network NN_(TX) 318provides, at its output, a sixty-four-dimensional vector (in the form ofan activated linear function) representing each of the mel-spectralcomponent OTX₁ OTX₂, with these vector OTX₁ OTX₂ assembled with theother output vectors representing each of file's evaluated properties,i.e. tonality, timbre and rhythm. The resulting 256-dimensional vectorsfor each of the two pairwise files are then made the subject of thedistance evaluation in Euclidean space, as indicated above andrepresented in FIG. 2.

The initial/upper convolution layers of the CNN function to identifyfilter weighting to be applied to across neural nodes in order to defineuseable parametric functions that allow identification of these patternsof interest in the mel-spectrum [that is the input in the CNN]. Valuesfor the parameters 612-620 of the filter matrix are thus learnt byiteration and backpropagation that tests the viability of alternativevalues to optimize an output, with optimization developed duringsuccessive passes across the source input data and varying source inputsof the training set.

FIG. 6 is a representation of an artificial neural network 700 employedwithin the various ANN property-processing chains of FIG. 2.

Each of the ANNs for rhythm “NN_(R)” 310, tonality NN_(TO) 312, timbreNN_(TI) 314 and musical texture (post convolutional processing) NN_(TX)318 includes a multi-neuron input layer or level 702 followed by atleast one and usually a plurality (1^(st) to k^(th)) of hidden neuronlayers that contain at least the same number of individual neurons704-718 as the multi-neuron input layer or level 702. The k^(th) hiddenlayer provides an output level 720, with the number of neurons in theoutput generally less than the number of neurons in the preceding k^(th)hidden level.

In terms of basic neuron mapping, an output from each neural (such as inthe first input layer) is mapped on a many-to-many basis as inputs intoeach neural in the immediately following (e.g. 1^(st) hidden) layer. Thek^(th) hidden layer, i.e. the penultimate later of each ANN, mapsmultiple inputs to each of its outputs (O₁ to O_(m)) on a many-to-onebasis such that the output O₁ to O_(m) is a linear function (such asdescribed athttps://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6).

Each quantized signal representation extracted for each identifiedproperty (in the case or rhythm, tonality and timbre) or the flattenedoutput from the CNN function (for texture) is provided as an input (i₁to i_(n)) to one of the neurons of the input layer 702.

Taking neuron 712 as an example, it can be seen in FIG. 6 (left side,boxed representation) that the neuron receives a plurality of weightedinputs w_(i,1), w_(i,2), w_(1,3), w_(i,r) that are summed together in asumming function 730. The summing function, in fact, includes asecondary bias input b_(i) which is generally just a learned constantfor each neuron in each layer. It is the weights w_(i) and the biasb_(i) that the processing intelligence estimates and then revises thougha backpropagation process that takes the pairwise Euclidean propertydistance measure 330 as the influencing factor and, particularly, howthis assimilates/maps to the corresponding pairwise target distance insemantic space. An output a_(i) from the summing function 730 issubjected to a non-linear activation function f (reference number 734).The output of the neuron y_(i) is propagated to the next layer.

In the exemplary sense of pairwise audio data signal comparison, theinput i₁ to i_(n) may be derived from the Essentia feature set asidentified above in relation to timbre, tonality, rhythm, whilst the CNNmel spectrum provides the neuron input for the texture-dedicatedartificial neural network NN_(TX). The final o₁ outputs o_(m) to o_(m)form the 64-dimensional embedding vector for each particular property,e.g. timbre OTI₁ and texture OTX₂.

With respect to a preferred implementation for FIG. 6, there are atleast two hidden layers. The first hidden layer contains five hundredand twelve (512) neurons. The second hidden layer contains one thousandand twenty-four (1024) neurons. The activation function in both of thesehidden layers is, preferably, the ReLU function, such as described athttps://en.wikipedia.org/wiki/Rectifier_(neural_networks).

Referring in detail now to FIG. 3, the training process by which thesystem of FIG. 2 is trained is set out in general terms.

From a universal training set of audio tracks (or a selected subset ofpairs), a pair of tracks for semantic and musical property comparison isselected 402. Both tracks are then subjected to feature extraction 404to identify properties, e.g. multiple sets of measurable descriptorsthat can be used to define rhythm, etc. Texture, as indicated above,follows a modified process given the nature of the mel spectrum. Foreach pair, the properties are commonly processed by the systemintelligence to train the network and refine the weights and bias valuesapplied 406 in each of the parallel artificial neural networks forrhythm “NN_(R)” 310, tonality NN_(TO) 312, timbre NN_(TI) 314 andmusical texture NN_(TX) 318. Regardless of whether ANN processinginvolved a CNN or not, each of the multiple parallel neural networksoperate to contribute 408 an embedded vectorial output 350, 352[assembled from contributing vectors OR_(x), OTO_(x), OTI_(x) andOTX_(x)] in (typically Euclidean) property space for each of the pair offiles under consideration. An assessment/determination 410 of aEuclidean property distance between the vectorial outputs 350, 352 foreach of the files is then undertaken. The determined Euclidean distance,calculated by the neural networks, is then mapped/contrasted with thesemantic distance (in semantic space) between the same files (asdescribed in relation to FIG. 1).

If it is assessed 418 that there is general numerical correspondence 416between the property distance and the quantized semantic distance—whichis unlikely for initial weights and bias values at the outset oftraining with the first few tens/hundreds of pairwise comparisons—then adetermination may be made as to whether the weights and biases in thecontributing ANNs satisfy an agreeable rule. This may permit the cuttingshort of ANN training without exhausting all pairwise comparativeoptions, although optimization in each NN will be improved with anever-increasing number of pairwise assessments and weight and biasrevisions.

From a practical perspective, the system is typically arranged toundertake several runs or “epochs” through the entire training set.Training can be halted when (a) the training loss does not improve overseveral epochs, or (b) the validation loss (on unseen data) does notimprove. It is noted, also, that if the training loss improves but thevalidation loss does not, the this is indicative of overfitting.

At the outset of training, however, there will likely be significantdifferences and a requirement for refinement of ANN operation in termsof parameter identification using refined filter weights w_(i) and biasb_(i) values. This is achieved through use of the entire universe oftraining data to optimise ANN performance Consequently, the trainingprocess replicates the path of pairwise assessment for all members inthe training set. This is represented by decision block 414 and negativeor affirmative paths therefrom.

If there is repeated close correspondence (affirmative path between thequantized semantic distance and the (typically-used) property distanceobtained from the vectorial outputs 350, 352) for file after file, thenoptimization of the weights and biases may be assumed to have beenachieved (at least to an appreciable and acceptable extent).

Returning to the path (i.e. negative outcome 420) between wheresignificant numeric discrepancies exist between the distance measures insemantic and property spaces, then filter parameters and, particularly,applied weights and bias in one or more of the neural networks need tobe adjusted. The objective in this adjustment is to realise a numericalconvergence between vectoral distance dissimilarity measures in propertyspace to associated, i.e. corresponding, distance dissimilarity measuresin semantic space. It is noted that, in this respect, the values inproperty space will invariably vary from the hard values of zero and onein semantic distance space because perceptual differences and absolutedifferences exist between dissimilar pairs of tracks (even if thecompared tracks are cover versions of the same song). Checking for lossor overfitting after each epoch is a typical approach.

The processing intelligence in the system therefore adjusts 422 weightsand biases through backpropagation to seek convergence between semanticand property (numerically-based) distances. These adjusted weights arethen applied to the neurons in the various neural networks, as shown inFIG. 2, in order to improve the alignment for a next pair of files inthe training set.

The training of the ANNs yields distance values in property distancespace that reflect track dissimilarities on a pairwise comparativebasis. Consequently, once trained, any distance in property distancespace maps accurately and reliably to actual perceivable differences insemantic space. The changing of weights and biases in the neurons of theANNs is the transformative function or mechanism by which the propertyspace is mapped into abstract semantic space.

Once the training set has been exhausted, the neural networks areassessed to have been optimised. This is reflected by the affirmativepath 424 from decision block 414.

As will be understood, each processing chain for each extracted propertyis a machine. In the present exemplary case of audio evaluation, thereare four machines: one each for rhythm, tonality, timbre and texture. Inorder to optimise the training process, it has been appreciated that theindependent machines each make an independent, de-coupled contributionto the final vectorial representation 350, 352 in property space.Consequently, a preferred approach, on a pairwise assessment basisrelative to the semantic evaluation in semantic space, is to adopt aweighting of importance between each of these effectively parallelindividual machines. In other words, the training process determines arelative importance between particular audio descriptors (associatedwith each property) within each input to the respective ANN. This meansthat each machine learns which of the specific contributing extractedmeasurable values has the greatest impact in altering a final resultthat reflects the desired human subjective assessment (in semanticspace). To achieve this, the system operates to assess two tracks intoeach machine. Each machine is then configured to identify similaritiesor dissimilarities between the set of quantized representations used todefine each property being evaluated by the specific machine. Themachine, in adjusting its biases and weighting factors in thebackpropagation process, operates to downplay, i.e. reduce the relativesignificance of, the property (e.g. rhythm) if there's dissimilarity (inproperty distance space) with the corresponding property being, in apreferred embodiment, simultaneously evaluated in the specific pairwisecomparison in semantic space. In other words, identified dissimilaritydoes not contribute to generating a set of biases and weights thatbrings about better alignment with the semantic assessment and semanticdifferences between evaluated pairwise audio tracks in semantic space.As such, across each machine, the system intelligence weights implicitlythe other properties (in both tracks) in particular machines since theseother properties are assessed to have a greater impact on aligning withthe semantic assessment, i.e. rhythm vectorial components OR_(x) may beassessed by the system to have a greater contribution to humanperception of the qualities of the audio content relative to thetonality vectorial components OTO_(x). Indeed, extending this principleto individual quantization representations, machine-identifieddissimilarity between individual quantized representations (such asbarkbands_crest values that contribute in Essentia to the propertytimbre) in comparative pairwise tracks means that such individualquantized representations are of less significance in aligningproperty-based vectors to the semantically-based values.

It will be appreciated that the accuracy of a resulting transformativefunction of the neural network is dictated by the robustness of thetraining data and particularly the size of the matrix so whilst tenthousand audio files might be assessed to generate correspondingly tenthousand vectors, it is perceived that significantly fewer orsignificantly more can be critiqued by NLP to provide the embedding.

To build a comparative library, it is now necessary for each of thefiles in the training set to simply be processed 426, on anon-comparative basis, through the ANNs to generate a Euclidean vectorfor that track. This vector can then be stored 430 in a database asvalue cross-referenced to a file name, e.g. a song title and artist orother form of identifier. Since the vector is comprised from distinctcomponents attributable to particular file properties, the vector canitself be parsed to permit searching for a particular identifiedproperty. For example, if commonality in rhythm is an over-ridingrequirement, then any numerical closeness between source and referencefiles in this particular contributing (in the preferred but exemplarycase) sixty-four-dimensional output OR_(x) is deterministic of semanticcloseness in rhythm.

In other words, when the individual artificial neural networks forrhythm “NN_(R)” 310, tonality NN_(TO) 312, timbre NN_(TI) 314 andmusical texture NN_(TX) 318 have been optimised, the measurableproperties of an (exemplary) audio track are reliable reflected in amulti-dimensional vector generated by processing a sample (e.g. partialor entire song) of the audio track through the various NN having setoptimised weights and biases.

Consequently, based on an absolute value scale, actual perceivabledissimilarities or similarities can be assessed for track against track,including new tracks that were not used in the training data set. Atthis point, the semantic distances used for training can therefore beignored because semantic space has now been mapped to an absolute scalewhere close numeric values accurately represent contextual similarity,whereas large numeric distance represent user-discernible dissimilarity.

FIG. 7 is a flow process 800 employed by a preferred embodiment toassess a measure of emotionally-perceptive file dissimilarity,especially in the context of an audio file.

Once the neural network of FIG. 2 has been trained, an audio track (orthe appropriate category of file) is selected 802. The selection istypically by a user, such as an owner of or subscriber to a musiclibrary or service. Alternatively, selection may be in the form of anupload of a piece of music or file, including an original composition.The selected or uploaded “first” audio file is then processed to obtainfeature extraction 804 of identifiable properties, such as tonality,etc. The neural network of FIG. 2 then processes 806 the extractedfeatures using the optimised weights and biases to generate 808 a firstfile vector V_(FILE) (in Euclidean property space or some otherappropriate property space) representative of a plurality ofuser-discernible or user-selectable, system measurable properties ofthat particular file. Referencing 810 the file vector V_(FILE) for thefirst audio file into a library that is indexed by both file identifiersand associated file vectors (for those other files) permits thoselibrary-based files to be listed 812 in a descending order of semanticsimilarity to the first audio file. This can be achieved with orsupplemented by the use of kNN analysis.

FIG. 8 is a system or network architecture 900, including an accessibledatabase 902 containing vector representations reflecting filesimilarity/dissimilarity measures according to aspects of the presentinvention.

Typically, a network (such as the internet) 902 permits communicationsto be passed between devices, such as a server 904, a home computer 906and a smartphone 908. These three categories of device are not limitingbut indicative of both processing intelligence within, and access pointsof/into, the system 900. The server 904 typically supports theartificial neural network 905 described above especially in relation toFIGS. 2 and 6. The system intelligence may, however, be moredistributed, including being cloud-based or distributed between aplurality of interconnected servers. For the sake of clarity only,system intelligence is simply shown as a block within the server,although it will be readily appreciated that computing power is alsowithin the smartphone and computer. The server, as with otherinteracting units, will include general control firmware and software914, e.g. to support web-based access and/or to control registration ofuser's to services administered by the server or other service provider912 and/or to support communications protocols. The server may regulateaccess and information loaded into or extracted from a source database306 coupled to the server, e.g. via a LAN or WAN. This access may be bythe computer 906, smartphone 908 or the like.

The source database may, in fact, be an existing library of files, suchas a catalogue of audio files. Files in the source database may,therefore, over time be extracted by the server and processed to producecross-referencing between files identities (such as track name andartist) 920 and generated Euclidean vector measures (V_(FILE)) 922representative of file properties aligned with emotionally-perceivedsemantic qualities.

The provision of a user interface 930, such as a touchscreen of agraphic user interface “GUI” on, for example, a smartphone providesaccess to searching tool software application that permits searching fortracks sharing close semantic properties according to the invention. Thesoftware may be local or otherwise accessed through a web browserallowing interaction with the server 904, databases 306 or serviceproviders (such as social media companies having access to content).Alternatively, the software may be hosted as a web-based service.Preferably, the GUI 930 offers the user with a number of “soft” slidercontrols that relate to selectable properties or listening/searchingpreferences, e.g. a first slider may relate rhythm. The slider positionscan therefore be altered, by the user, to reflect search parameters thatcorrelate to individual contributing multi-dimensional vectors OR_(x),OTO_(x), OTI_(x) and OTX_(x) in the final embedded vectorial output 350,352. Setting the sliders on the GUI therefore targets specific vectorialaspects in processed tracks 920 stored within the system.

Referring now to FIG. 9 which illustrates two exemplary embeddings inthe context of exemplary video file assessments based on selectedextracted properties that define the file. A first embedding (“embedding1”) 950, shown at the top of the figure, has an exemplarymulti-dimensional vector “101-1101-11101101—” assembled fromconcatenated contributions for measured properties for colour 952,presence of an object 954, texture 956 and OTHER 958 that are extractedfrom a first data file. The term “OTHER” has been used to indicate thatother properties may be measured and the multi-dimensional vectorextended in bit length. A second embedding (“embedding 2”) 960, shown atthe bottom of the figure, has an exemplarily generated multi-dimensionalvector “111-11101-10001111—” assembled from concatenated contributionsfor extracted measured properties that correspond to colour, presence ofan object, texture and OTHER properties for a second different datafile. Further, as indicated above, individual quantized dimensions mayhave, in terms of the overall sematic representation, greater or lessersignificance relative to other quantized dimensions in categorizing eachfile for a particular user. For example, the property colour [in thecontext of an exemplary video scenario] may be less significant to auser's searching objective or user categorization of that file [from theperspective of relative importance] than the property “presence of anobject” [which suggests fast moving evolution]. Also, as indicated,there is no reason for the individual vectors to be of a common length.

As will be understood from the foregoing, each file vector is anexpression concerning semantic qualities but expressed inmulti-dimensional property space. Each file vector V_(FILE) is thus asemantic representation.

Turning now to FIGS. 10 to 15, these figures illustrate use of embeddedvectors (or “embeddings”) in terms of system interactions with a userand deliverables by differently arranged systems having differenttechnical objectives. FIG. 10 is, in fact, closely aligned with andrepresentative of the process shown in FIG. 7.

In all cases, each embedding is a file vector V_(FILE) derived by theprocess of FIG. 7 and associated with the data file, and thus reflects anumeric evaluation of semantic perception of dissimilarity/similaritymeasures for that particular data file as obtained from extracted fileproperties for that particular data file. The multi-dimensional vectorthat is the embedding is obtained from processing of the supplied filein the trained neural network of FIGS. 1 to 6 and FIG. 8. As indicatedherein, the vector V_(FILE) associated with a data file is a semanticrepresentation of a plurality of user-discernible or user-selectable,system measurable properties of that particular file. The association,although preferably captured in file metadata, can be looser and just asuitably accessible cross reference in a table or link into a database.Direct incorporation of the file vector V_(FILE) into the metadata ishowever preferred.

In certain cases, an initial query from a user need not contain the filevector V_(FILE). Rather, the initial query may take a variety of forms,such as a textual description or more likely a virgin, i.e. unprocessed,raw data file. Of course, the query could also take the form of a filethat already contains the file vector V_(FILE) (as generated at step 808of FIG. 7).

Any virgin file can be either processed locally at a user device (the“client side”, such as a smartphone or computer), or remotely on theserver side at a server or other intermediate processing entity accessedvia the network 902 (of FIG. 8). For local processing, the local devicedownloads an adapted version of the training process/trained network topermit local evaluation, although the user device could interact with aspecific site and make use of javascript or the like. The common threadis that processing of the virgin file generates a file vector V_(FILE)in measurable property space, and that processing intelligence in thesystem (whether this is at the user device, the server or elsewhere)functions to extract measurable properties from the virgin data file andapplies those extracted properties through a trained neural network togenerate a realistic semantic assessment in property space for thevirgin data file.

For the sake of explanation only for FIG. 10, the query 1000 from a user1002 is assumed to take the exemplary form of a virgin, i.e. previouslynon-evaluated, music data file 1004 although the file could equally be avirgin video clip/file, speech or a virgin text file). The virgin datafile is processed, somewhere at an appropriate processing point in thesystem, in an ANN 1006 that has been trained to provide asimilarity/dissimilarity file vector V_(FILE). This vector is associatedor embedded with the virgin file which can therefore be contextuallyplaced within a database 1026 of reference vectors 1010 (and theirassociated user-consumable relevant data) and thus surrounded, in asemantically relevant cluster 1012 of user consumer data, by of one ormore relevant candidate files 1014-1022. By comparing and contrastingthe respective reference vectors of pre-stored candidate files in thedatabase relative to the newly created vector for the query 1028,relative numerical ordering of semantic distance d [based on absolutemeasures between respective vectors] of those candidate get files to thequery can be established and the k most similar/closest target files1030-1034 in the database identified and communicated 1030 to the user1002. Ordering of the growing number [assuming that the query,associated data and vector is stored in the database by the system] ofreference data files in the database according to increasingnumerically-evaluated vector distance variations thus reflectsuser-perceptive similarity/dissimilarity between all files in thedatabase.

The k closest identified candidate files therefore represent an improvedrecommendation process that is based on a numeric characterization ofdata in which the evaluation is based on a process that reduces thesemantic gap by improving alignment between real-world perception andsuggested artificial reality.

The reference files/reference vectors in the database 1026 may befurther partitioned or rearranged. The semanticsimilarities/dissimilarities may be reflected by assessed quantifieddifferences over the entire concatenated lengths between respective filevectors V_(FILE) but may also be assessed over one of more selectedportions of the concatenated vectors that are reflective of one or morespecific extracted property/properties of higher user-selectedrelevance, e.g. colour may be more important to a user than the presenceof sound, or age of a file may be of importance.

It is noted that, whilst shown remote to the user 1002 and described asaccessed via a server, the reference database may be local to the user,e.g. stored in a local hard drive on a user's computer.

In FIG. 10, the application therefore assumes that relevance ofrecommendations is only determined by semantic similarity relative to aquery item, e.g. a media file, provided either by the user or else themedia item last consumed by the user. This approach holds in applicationscenarios where other user and media item information is irrelevant,e.g. in the case where a music industry professional is searching for aroyalty-free song to replace a temp track and where information relatedto the user's demographic or previous searches is/are irrelevant.

In FIG. 11, the system uses a semantic representation (i.e. the embeddedfile vector V_(FILE)) to generate a set of candidates that are then madethe subject of a secondary search that refines the results and furtherimproves recommendations made to the user.

In contrast with the approach in FIG. 10, the user 1002 registers with aservice, e.g. through a server-supported log-in procedure. During thisprocess, the user 1002 inputs user data 1102 relating to independentuser attributes that are not related to consumed content. These userattributes may include demographic data, such as age and gender to namea couple of well-known exemplary attributes. These attributes are storedwithin the system and typically on the server side although they couldalso be cached locally and uploaded at user log-in for use by systemintelligence. The registration process permits the server to identifythe frequency of user log-in, as well as geo-location of the user(through the domain) and geo-location of accessed content, watch time ofany user-accessed content, historical purchase of viewed content. Theregistration process thus permits a profile to be established, includinga consumption history of types of data files, e.g. genres for films ormusic, streamed TV programs, downloaded books, etc. Profile acquisitionis well-known to the skilled addresses and further explanation notrequired for the sake of understanding of the invention. User data maybe acquired from a third party database of a platform to which the userhas already registered.

In this aspect, recommendation can be made proactively by the systemintelligence (whether local or disparate to the user) rather than inresponse to a direct uploaded user query.

The system intelligence again, ultimately, is configured to provide alist of recommended consumable files to the user, albeit in a two-stageselection process. To do this, the system initially identifies arecorded user consumption history 1104 of N consumed files. Each ofthose N consumed files has an associated/embedded file vector V_(FILE)generated by the trained neural network of FIG. 2 and known (e.g.pre-stored) or calculable by the system intelligence. In other words,each of the N historically consumed files has a semantic representationreflected in its own manipulable property vector. For a set of Nconsumed files, where N≥1, the system intelligence selects Msemantically similar files to each of the N consumed files. Theprocessing intelligence operates a selection that is assessed by therelative quantitative distance measured from each of the file vectorsV_(FILE) of each of the consumed files to relative to all otherreference file vectors V_(FILE) for deliverable, i.e. recommended,candidate files stored in the database. The system intelligence thusgenerates a set of M*N candidate files (where 1≤M≤N) as initial pools ofcandidate files for recommendation to the user. For each of the N setsof M files, ordering relative to each consumed file generates a list ofsemantically similar files that are preferably ordered from close todistant. Ordering may be based on an ordering of absolute relativedistance measures across all M*N files identified in the process, orordering may be in groups around each of the historically-consumed files(reflecting the fact that semantic closeness may tie to different filequalities, e.g. tonality and timbre in exemplary musical instances).

The process does not need to be triggered by a specific user query (suchas the provision of a file to the system intelligence) but generally theprocess will be triggered by a user interaction.

These M*N candidate files provide a first input to a pretrainedpredictive model, reinforcement learning “RL” algorithm or heuristicprocessing function 1110 (collectively referred to as an “predictor”).

As a second input to the predictor 1110, the system intelligence appliesstored user data acquired from the registration or log-in processes. Asa third input to the predictor 1110, media information is applied, whichmedia information relates to explanatory descriptors related andcross-referenced to the content, e.g. composer, author or directordetail, production or distribution date, genre (for audio or filmfiles). Other media data will be readily appreciated and can take manyforms of descriptor, including hashtags, download rates or absolutedownload numbers and global feedback from the system's user-base.

The predictor 1110 is arranged to calculate a score 1112 based oninteractions with the user over time and, particularly, through varyinglearned weighting applied across the three inputs with time. The processof learning weighting that is to be applied to the three inputs to thepredictor to reflect user preferences are processes known to the skilledaddressee, e.g. the contextual bandit algorithm (Li, Lihong, et al. “Acontextual-bandit approach to personalized news article recommendation.”Proceedings of the 19th international conference on World wide web.2010). Numeric ordering of the scores for the predictor 1110, havingregard to weighting of the three inputs, provides a refined listing of ksemantically close and relevant candidate files for output as arecommendation list 1030 to the user 1002 (where k<N*M). Therecommendation list 1030 may be data files for direct instantiation orreview at the user's device or otherwise a link to the data files. Therecommendation list represents a refined and more accurate reflection ofrelevant materials because the process technically reduces the semanticgap by making use of manipulate file vectors that code information in afashion that aligns an objectively assessable property-based vector withsemantic reality.

In summary of FIG. 11, semantic similarity is assumed to be one ofseveral factors that will determine the relevance of a recommendation toa user. Other factors may include user demographics, geo-location of themedia item and the user, media item popularity (i.e. the number of viewsor user-reported likes), media item meta-data (i.e. hashtags), previoususer feedback for media item (i.e. watch time, purchase) among manyother options. Semantic representations of the N previously consumeditems are used to identify M semantically similar items for each. Theresulting N*M items represent candidates. A pre-trained predictivemodel, reinforcement learning agent or heuristic can then be used todetermine a score for each candidate, taking into account all availableuser and media item information, where the latter may include thesemantic representation. Finally, the system intelligence functions torecommend k media items based on the predicted scores, where k<N*M.

User data in the various applications described in the context of FIGS.11 and 12 and elsewhere herein relates to data not derived from filecontent.

User queries may take the form of words, video, images, speech or just arequest, e.g. ‘give me a continuous flow of semantically relevant filesfor consumption’. Supply of recommendation based on little userdirection may be based on observed user interactions with the system,e.g. recorded file download or watching behaviours acquired from thespecific user's consumption history.

Referring now to FIG. 12, its approach is not dissimilar to thatdescribed above in FIG. 11 because the semantic similarity is treated asone of several factors that will determine the relevance of arecommendation to a user. However, in this solution, the set ofcandidates is either the entire reference database, or some form ofheuristic pre-selection is applied (i.e. based on novelty or a randomselection of candidate files). Again, a user profile is established anduser data 1102 stored in the system.

In terms of the inputs considered by the predictor 1212, additionalmedia information from the consumption history may include specificuser-selected preferences. For example, the additional media informationmay include:

-   -   in music: genre and tonality, artist, production year and        instrumentation, label, etc.    -   in video: genre, picture aspects ratios, the nature of the        programme (e.g. documentary or film or trailer), director,        actor, the original form of broadcast (e.g. TV or other recorded        live performance),    -   for images: the nature of the photograph (e.g. sports or        countryside), the pixel count, colour or black and white,        hashtags and the creator identities,    -   for text, such as medical records: authors and creation date.

User feedback can include user ratings, user interactions (such as filesharing and messaging) and timed observation by the system of userinteraction with a previous file. In the latter respect, if a user skipsover a file, then this is indicative that the file is not relevant tothe user or the user's mood at that time.

In FIG. 12, the predictor again predicts a score for each candidate byconsidering available information about the candidate (which includesits semantic representation in the form of the file vector), the userand the user's consumption history (which again includes semanticrepresentations in file vector form). Finally, k media items arerecommended to the user 1002 based on predicted scores from thepredictor.

Referring to FIG. 13 and FIG. 14, this approach to recommendation isbased on “cold start” where the system intelligence has no or verylimited information concerning a new user or new file. In this respect,existing predictive models commonly lack accuracy when there is littleor no data available for either a new user who has just subscribed to anew social media platform or a new specific media item, e.g. when amedia item has only recently been added to a library and has not yetbeen consumed and/or rated by a sufficient number of users.

The media item cold start scenario shown in FIG. 13. For the exemplarycase of a new video, there is no background knowledge to allowrecommendation of the new video using the predictor 1306. Each time thesystem intelligence is requested to make a recommendation withprobability p, the system implements a purely content-based approachthat recommends, from a pool of videos for which little user feedback isavailable, a new video which is semantically similar (assessed using itsfile vector) to videos previously consumed by the user. For probability(1-p), the predictor generates a recommendation list from a pool ofvideo for which a high amount of user data exists. Each time thecontent-based approach is probabilistically selected, it will generate anew data point that can be interpreted by the predictor in the nextrecommendation cycle to improve the predictor's performance.

Referring to FIG. 14 and its applicability to a user cold start, thesystem intelligence deals with a request, from a user, for arecommendation 1402. The system is arranged to implement a probabilisticapproach. The system has available to it, in the reference database,files having related file vectors numerically indicative of semanticqualities of the file. In terms of a start, the system intelligenceidentifies recently consumed files, e.g. media files that have a highhit rate and play duration from within the universe of users of thesystem. A recommendation listing is therefore initially provided to theuser based on a purely content-driven approach 1404 in whichrecommendations can flow from semantically similar files to the current“hot” topical files viewed/downloaded by users. The content-drivenapproach can be terminated at a point when a predetermined threshold ofdata points has been reached, e.g. the user has been subjected to a setnumber of data files and has consumed these. The content-driven approachmay have an exploration component that injects random files intocandidate files in order to generate a response in the predictor andthus to avoid being caught in a limited semantic environment.

With increasing acquired knowledge from specific user data, a predictor1406 presents an alternative path to effective recommendation. Thepredictor 1406 is arranged—or increasing is able—to resolve, in responseto the request for a recommendation 1402, acquired user data and a setof candidate files that are ordered, again, based on semantic distanceas reflected in relative contrasting of multiple file vectors V_(FILE).User data may further include, but it not limited to, geo-location datafor the user or the origin of a consumed file.

The system intelligence behind FIG. 14 therefore treats the alternatebut potentially complementary contributions through (a) the purelycontent-based approach 1404 and (b) predictor 1306 as weightedprobabilistically differently over time, e.g. after L file items havebeen consumed by the user. Initially, the probability p of making auser-sensible recommendation 1408 following a cold start favours ahigher reliance, weighting and use of the purely content-based approach1404. At this point, p may take a value of 1 so that the recommendationis entirely content based at the outset; this leaves the predictor withthe ability to further research and evaluate the dataset with a view toinfluencing the recommendation when sufficient user consumption providesa meaningful and reliable contribution to recommendation. After time andwith the acquisition of specific and sufficient user data, processingand recommendation from an output from the predictor [which functionspredictively or a collaborative filtering basis] is recognised by thesystem as more relevant. Thus, the system intelligence is arranged tovalue the probability (1-p) of the recommendation being influenced bythe predictor's assessment as more dominant with time. The value of pcan be set to zero in force use of a hard threshold.

In sum, the system intelligence supporting the process shown in FIG. 14may be arranged, for example, at each new enquiry from a newlyestablished user account to decide whether to recommend a “new” file(e.g. a media item) based on semantic similarity with previouslyconsumed items with a probability of p or to use a predictive orcollaborative filtering based mechanism to recommend an item among a setof candidates for which sufficient data is available with probability1-p (where 0<p<1).

Turning now to FIG. 15, the basic architecture resembles that alreadydescribed in FIG. 10. A query, such as a data file 1502, is communicatedfrom a user device (not shown) through a trained neural network 1504that extracts properties to generate a vector representation thatmaintains a semantic descriptor with a corresponding vector expressed inproperty space. The system includes a database 1506 of referencevectors, although this database 1506 also cross references the vectorsto textual descriptions. Consistently, the processing intelligence ofthe system is arranged to identify semantically close candidate files(shown as triangles in FIG. 15) relative to the file vector associatedwith the query (shown as a circle in FIG. 15). Again, vector comparisonleads to an ability to order semantically close files based onidentified distances, with greater distances between vectors increasingthe level of dissimilarity in the content. The system intelligence thenmakes a recommendation, although this may now be a single output or amessage that is communicated to the user who generated the query and/orto a third party, including a secure computer at a government agency.Textual descriptions, as will be understood, are present within theoriginal training set since these are subject to NLP to generate thesemantic vectors in order to assess pairwise distance and have thesemaintained relative to pairwise distances for extracted properties forthe each pair of training files.

FIG. 15 operates on the premise that the ANN is able to code the querypresented as a file into a vector that aligns to other semanticrepresentations and particularly textual description in the database.

At a first step, the algorithm of FIG. 7 computes the embedding of thenew query in the semantic space and subsequently detects, typicallyusing Euclidean distances, a number of close neighbour candidate filesfor which the computed Euclidean distance does not exceed a predefinedthreshold. At a second step, one or more textual descriptions of thecandidate files are assembled based on the file vector generated for thequery and a cross-referencing in the database. Retrieved textualdescription for the candidate files, given the vectorial closeness, willgenerally describe in different and possibly complementary ways theobject of the new query.

The processing intelligence, in response to the textual descriptions ofnear-neighbour candidate files applies natural language processingtechniques, such as off-the-shelf summarisation algorithms, to generateone representative composite textual description from all the retrieveddescriptions associated with the candidate files within the thresholddistance. This composite textual description is communicated from theserver-side system intelligence to either the originator of the query ora third-party gatekeeper for a social media platform, including judicialauthorities.

An important application context of this setup is the detection ofillegal content, i.e. inappropriate images or videos that could bepublished on social media platforms. Here, a media item would not onlybe flagged as potentially being illegal based on the annotations ofsemantically close reference files, but the generated description wouldprovide hints towards the type of violation of user guidelines, e.g.“porn”, and/or legal restricted content of a violent or perverse nature.Any generated description and file related to the original query can behanded to authorities if there is a suspicion of the content beingillegal. Such a system can catch illegal content early on and avoidscontent moderators, at the social media site, to be exposed to suchillegal content, especially offensive images. The system intelligence,in fact, acts as a filter that immediately stops content from beinguploaded and stored based a semantic evaluation of the query and thegeneration of a file vector that can be assessed against known filevectors. In fact, the generation of the vector for the query and thecomparison with other vectors obviates the need to review or even storeanywhere the specific content of what would be neighbouring candidatefiles. It is simply sufficient for the system to have a reference tovectors and a brief textual description or even just a warning code forthe system filter to the applied.

Referring now to FIG. 16. The vector generation scheme described hereinpermits for a progressive transition in a playlist between a source datafile and an end data file along a logical path in which transitionsbetween consecutive files is perceptually acceptable. In this context,as a musical example, a transition from a heavy metal song to a piece ofchoral music may be achieved is a sensible fashion through intermediatesteps that ensure that transitions and semantic distances betweenadjacent files are within a threshold distance and that, the directionof travel is towards (and not away from) the end data file. In thisrespect, vector distance relative to the end point are assessed at eachstep in generating the stepwise plan.

In terms of a further application context, playlisting aims at groupingmedia items with similar characteristic. However, real-world DJs and VJsoften pass through several genres in a single session, while maintaininga “smooth” transition between successive items. Using the arrangement ofa set of media items in semantic space, such behaviour can be simulatedalgorithmically by specifying a succession of source and target files.Standard graph theory methods can be applied to determine a path throughfrom source to target via other media items under a set of pre-specifiedconditions, including the number of media items in the path or a timefor transition between source and end points. If the path findingmechanism of the system intelligence is configured such that theEuclidean distance in semantic space between successive items isminimised under the applied restrictions, improved “smoothness” inautomatically determined transitions between media items is achieved.

The embodiments described in relation, particularly, to FIGS. 10 to 16(although applicable to the entirety of the embodiments and aspectsdescribed herein) address of the problem of closing the semantic gap ina way that permits reliable correlation of semantic perception to outputfrom a trained neural network, and especially to eliminating erroneousrecommendation or improving accuracy in recommendation from the outsetin presented results over which an end user has no ability to exercisecontrol, to validate or to even appreciate. The system and methodologycreates a continuous multi-dimensional space in which there aremeasurable values attributed to distances in semantic distance butmeasured in property space, with these distances defined for eachindividual file and measured from the perspective of each individualfile of a multitude of files in an extensive database of files. Theresultant vectors for all the files provide a description of thedatabase as a whole irrespective of the point of perspective taken as astart point within the database. As such, irrespective of the size of aset of selected files of the database, the file vectors describe therelationships between files irrespective of perceived closeness orperceived dissimilarity. This is not the same as simply assigning anominal tag to a file which is the best that a human programmer can hopeto accomplish for a collection of data files. The file vectors permit anappreciation of how files having two, three, ten or hundreds ofdifferent qualities to be mapped relative to each other and weightedaccording to their impact of perceived semantic similarity. Invisualising the problem addressed by the invention is just threedimensions, it becomes quickly apparent that it is simply impossible toappreciate the relative relationships [represented in the diagram belowas line distances between data files represented as circles and in whichdiagram only one such relationship is shown to five data points]between, say, ten data files (each having multiple qualities) in threedimensional space let alone fifty, one hundred, one thousand ortypically millions of files in multi-dimensional space as stored in adatabase of a social media platform.

Unless specific arrangements are mutually exclusive with one another,the various embodiments described herein can be combined to enhancesystem functionality and/or to produce complementary functions or systemthat support the effective identification of user-perceivablesimilarities and dissimilarities. Such combinations will be readilyappreciated by the skilled addressee given the totality of the foregoingdescription. Likewise, aspects of the preferred embodiments may beimplemented in standalone arrangements where more limited functionalarrangements are appropriate. Indeed, it will be understood that unlessfeatures in the particular preferred embodiments are expresslyidentified as incompatible with one another or the surrounding contextimplies that they are mutually exclusive and not readily combinable in acomplementary and/or supportive sense, the totality of this disclosurecontemplates and envisions that specific features of those complementaryembodiments can be selectively combined to provide one or morecomprehensive, but slightly different, technical solutions. In terms ofthe suggested process flows of the accompanying drawings, it may be thatthese can be varied in terms of the precise points of execution forsteps within the process so long as the overall effect or re-orderingachieves the same objective end results or important intermediateresults that allow advancement to the next logical step. The flowprocesses are therefore logical in nature rather than absolute. Thefunctional architectures of, for example, FIGS. 10 to 16 may beimplemented independently on one another, as will be understood.

Aspects of the present invention may be provided in a downloadable formor otherwise on a computer readable medium, such as a CD ROM, thatcontains program code that, when instantiated, executes the linkembedding functionality at a web-server or the like.

It will, of course, be appreciated that the above description has beengiven by way of example only and that modifications in detail may bemade within the scope of the present invention. For example, theprinciple by which the neural network is trained and howsemantically-assessed qualities, indicated by scaled distances, in asematic vector space can be mapped to an objectively-generated(typically Euclidean) vector in property space can be applied tomultiple forms of searchable data, including audio, visual and/or film,literature and scientific reports (such as medical reports requiringcross-referencing for trend analysis).

Qualities that may be extracted for such different source data includebrightness, contrast, colour, intensity and shape and relative size aswell as relative feature position and rate of change in some or all ofthese properties. Other measurable qualities exist for such files,including word-frequency (for text analysis) or motion-relatedmeasurements (derived from sensors), so the above is provided as anon-limiting example of how a property space can be populated withmeaningful vectors [in property space] that can be contrasted with andaligned to those presented in semantic space. For an image or video, theentirety of the pixelated image or a succession of frames could be usedto correspond to musical “texture”, with all pixels in the sampled imageproviding a two-dimensional matrix for convolutional processing. Indeed,it will be appreciated that there is a degree of overlap between theproperties derivable from static images or video and music modality, asdescribed in detail but exemplary embodiments above.

In terms of the process and particularly the training mechanism, it willbe understood that a quality of a signal may be defined by a singleproperty and that, consequently, the corresponding property vector issimplified in terms of its component parts. For example, the ANN may bepresented with raw data like a raw waveform or spectrogram. Thisapproach requires greater processing power because there are many moreextractable data points to consider in the input. This also has animplication for the architecture of FIG. 2, namely that for each trackthere is simply a feature extractor (such as element 132) that feedsinto only one vector in the connected neural network (such as NN 310)with no cross-linking to parallel neural networks (which are notneeded). For the two tracks 302-304, the respective outputs of theneural network would be (taking FIG. 2 as the example and ignoring thequality that is assigned thereto for the exemplary context of musicprocessing), OR₁ and OR₂. The output generated by the neural network foreach path is therefore not a concatenation of different contributingcomponents from different processing branches, but rather just a singlemulti-dimensional output from a single processing branch tasked withprocessing the input data.

FIG. 2 shows parallel branches feeding parallel ANNs 310-316, althoughit is possible to feed the respective different tracks one after theother through a single branch to generate, firstly, OR₁ and then,secondly, OR₂. This decreases granularity in the number of properties bygrouping together related properties to define a more general property,e.g. instead of distinguishing an exemplary musical file into timbre,tonality, texture and rhythm, an alternative embodiment may use a singleglobal property called “musicality” that encompasses all these signalqualities. This approach is particularly relevant in certain contexts,e.g. text, where a quality may simply be the frequency of a key word orthe modal value of a key word within a defined length of text. The ANNarrangement of the preferred embodiments (e.g. FIG. 2) thereforeprocesses multiple measurable qualities all assigned to a singleproperty. In this single path arrangement, the single path/branch may bea standard ANN or a convolutional network that processes either raw dataor pre-processed data, such as presented in a spectrogram, andirrespective of whether the underlying data is music, video, text,speech or image data. A standard ANN is also known as a feedforward ANN.

The distance comparator function 330 in FIG. 2 thus compares the vectorsOR₁ and OR₂.

Whilst the preferred embodiment makes use of pairwise comparison, analternate embodiment may use more than two input files and apply anoptimization process in which a loss function is based on comparativedistance between the two or more inputs to a reference. This means that,in the context of FIG. 2, there would be (for example) a third trackinput in parallel with inputs 302 and 304, with the track inputappropriately linked to the one or more branches of the neural network(depending on the definition of the quality and the number of propertiesbeing assessed), with an appropriately concatenated output vector 350,352 and a third output vector compared in a multi-input (i.e. three ormore input) distance comparator 330. For example, for three input files,the distance comparator could be arranged to evaluate a triplet lossfunction in which the desired objective is that a first item close insemantic space exhibits a small difference/distance to a reference itemand, at the same time, a second item exhibits an extreme/larger/largestdistance relative to the reference item in the context of semanticspace. It will be understood that the triplet loss can be considered astwo consecutive pairwise comparisons between data A and data B, and thenthe pairwise comparison of the assessed distance between A and B thenresolved in distance with third data C.

In the context of image processing, as explained above, differentproperties and qualities are measured and used to train the system. Asindicated above, expressing similarity between images (whether in thetraining sequence or afterwards in an active AI environment) generallyrelies on properties different to those in music. Tonality andtexture—or how each is used in the context of images and music—is anexception and its use different in these media domains.

For image processing, including static pictures and video inputs,embodiments can be based on one or more of the following considerations:

a) For the property “Texture”, measurable signal qualities includevalues for coarseness, presence of spots/dots, regularity,directionality and so on. Common standard descriptors include theTexture Browsing Descriptor, the HTD and the Edge Histogram Descriptor.Each such descriptor is computed by a standard algorithm and may consistof one or more numbers. These descriptors correspond to “measurablesignal qualities” as expressed herein. These descriptors may be stackedinto a single numerical vector that represents the texture of the imageas a whole. Therefore, a subnetwork of the described neural architecturecan be devoted to texture processing, much like a respective subnetworkis devoted to processing of timbre (in the sense of the exemplarycontext of music processing also described above).b) For the property “Colour”, its importance for visual understandinggenerally warrants a number of colour space descriptors to be extracted,e.g. a colour histogram descriptor, a dominant colour descriptor, and acolour layout descriptor. Other descriptors will be readily understoodby those skilled in the image processing arts, such as those describedby B. S. Manjunath, Jens-Rainer Ohm, Vinod V. Vasudevan, and Akio Yamadain “Color and Texture Descriptors”, IEEE TRANSACTIONS ON CIRCUITS ANDSYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 6, JUNE 2001. Each suchcolour descriptor may be realised by one or more numerical value(s) thatcapture certain image properties, including but not limited to spatialdispersion of particularly image-relevant dominant colours across aportion or totality of an image region. The colour descriptors cantherefore form another numerical vector. Therefore, a subnetwork of thedescribed neural architecture of the various embodiments can be devotedto colour processing.c) For the property “Presence of Objects in an Image”, an existingneural network architecture may already exist that identifies objects,such as faces, vehicles, clothing, etc., and is thus already trained forobject detection/classification purposes in images. One such network isthe Inception architecture described by Christian Szegedy, Sergey Ioffe,Vincent Vanhoucke, Alexander A. Alemi, “Inception-v4, Inception-ResNetand the Impact of Residual Connections on Learning”, Thirty-First AAAIConference on Artificial Intelligence, 2012. Such networks operate toextract “bottleneck” descriptors for an image at hand. These bottleneckdescriptors are long numerical vectors that are usually extracted fromthe penultimate layer of an appropriate neural network architecture whenan image is given as input and are considered to capture imageproperties related to the presence/absence of a large number of objectclasses. Again, this type of numerical vector can be fed to a subnetworkof the architecture of the embodiments described herein.

Rather than making use of data from existing trained networks or makinguse of predefined hand-crafted features [in the context of imageinterpretation], raw pixel values may be used as direct input intoconvolutional ANN (in a similar fashion to texture in the exemplarymusic application described above), or the data can be flattened (i.e.an unfolded numerical vector, i.e. a vector resulting from unfoldingmultiple dimensions to one) and applied to a standard ANN. The raw pixelvalues may have a 2D structure in the case of grey-scale images, or a 3Dstructured input in the case of RGB images. Use of raw pixel data cansupplement the properties of image texture, colour and object presence.

With video processing, data may further include temporal considerationswhere a feature evolves with time. This may affect the choice of the ANNand require the use of temporal models/recurrent architectures, such aslong-short term memory “LSTM”. Rather than a conventional convolutionalbranch, convolutional layers may be time-distributed using widely-knowntechniques. Any modality that has a time component, including music, mayalso make use of this property contribution in the context of auser-definable quality feature. The user-definable quality feature maybe any form of time-series data, including waveforms andsensor-generated data.

A further application of the embodiments of the present invention is inthe field of speech processing.

Speech pathology detection refers to the problem of classifying a givenaudio recording to a set of classes of speech pathology, like dysphonia,phonotrauma, laryngeal neoplasm and vocal paralysis, or deciding infavour of the absence of pathological symptoms.

To that end, a further practical application of the embodiments of theinvention can be based, for example, on a database of speech recordingsfor which medical descriptions are available in textual form and wheresuch textual descriptions might describe the presence or absence ofspecific speech pathologies in recorded speech. A textual descriptioncan emphasize the difficulty of a person to pronounce certain consonantsor that their speech contains an unusual number of short pauses. Inaddition, cross-referenced medical records can include categorical datarelated to a person's gender, age, education, profession and so on.

From the speech signals, this embodiment extracts measurable qualitiesfrom the speech recordings and, as appropriate or desired, group thesemeasurements together to define one or more property for the speechrecording(s). For example, speech qualities include (as will beunderstood and amongst other measurable qualities) pitch frequency[sometimes referred to as the “fundamental frequency”] and formantfrequencies of the speaker. Furthermore, data from medical records, ifavailable, can be used as another source of data qualities, as suggestedby Chitralekha Bhat and Sunil Kumar Kopparapu, “FEMH Voice DataChallenge: Voice disorder Detection and Classification using AcousticDescriptors”, 2018 IEEE International Conference on Big Data, to beprocessed in property space. As will be appreciated, there are manyother user-definable properties that can be assembled from measurablequalities of an input signal, so pitch and formant frequencies are justexemplary of possible qualities.

Some or all the aforementioned speech qualities (or other speechqualities) can be used as input to a single neural network or they canfeed separate branches (sometimes interchangeably referred to assub-networks) depending on a user-adopted definition of propertiesselected from—and defined by—one or more of the measurable qualities.For example, all measurable parameters, i.e. measurable qualities,stemming from a patients' medical records can be grouped together as afirst property and processed by one branch, whereas audio featuresmeasuring qualities of the recorded speech can be grouped and processedappropriately by one or more branches of the ANN.

The semantic space for the speech file is obtained, again, from asubjectively prepared description of the pathology. For example, awritten description of this pathology would then be subject to NLP togenerate a corresponding vector in semantic space.

With property input and semantic input now assembled from the above dataacquisition processes, training of the network is again undertaken usingthe described backpropagation processes that values semantic perceptionreflected in quantified semantic dissimilarity distance measures overproperty assessment reflected by the distance measure between the firstmulti-dimensional property vector and the second multi-dimensionalproperty vector and such that the ANN maps pairwisesimilarity/dissimilarity in property space towards correspondingpairwise semantic similarity/dissimilarity in semantic space.

The embodiments thus create a space of embeddings by training the neuralnetwork architecture with the proposed backpropagation method. Given anew recording for which a decision is required as to whether apredefined pathology is present or absent, the approach described aboveproduces the embedding of the recording via the trained network and adecision on the nature of the recording can be based on k nearestneighbours in the embedding space.

The exemplary four-property architecture described in FIG. 2 can thus bedirectly adapted to processing images using image texture, colour,object presence and raw pixel input. Of course, the number of processingpaths is not limited to four and, in the limit, could range from one tomany hundreds or more (dependent upon computing power, the complexity ofthe media domain and the size of the training set).

It is envisioned that processing of data may be multi-modal in that theinput is not purely music, or speech or video or text but a combinationof two or more of these media. In this instance, the semanticdescription may not change to any appreciable extent. However, thequalities and properties could extend across—and be assessedacross—different domains. For example, in a micro-video (i.e. auser-generated video such as uploaded to YouTube® or presented onTriller®) may have its qualities partitioned along the lines of: i) thepresence of objects along video frames; ii) the spectrogram obtainedfrom the audio signal, and iii) the textual data associated withhashtags. Of course, there are many other combinations of qualitiesrepresented by subsets of signal qualities or raw data, and manyqualities might overlap and make use of different subsets of derivableproperties.

1. (canceled)
 2. (canceled)
 3. (canceled)
 4. (canceled)
 5. (canceled) 6.(canceled)
 7. (canceled)
 8. (canceled)
 9. (canceled)
 10. (canceled) 11.(canceled)
 12. (canceled)
 13. (canceled)
 14. (canceled)
 15. (canceled)16. A method of providing a file recommendation based on sematicqualities, the method comprising: identifying a recently consumedreference data file that has been consumed by a user; processing thereference data file to extract properties therefrom; calculating a firstfile vector in property space from said extracted properties, whereinthe first file vector both preserves and is representative of semanticproperties of content of the reference data file; evaluating a new datafile in terms of semantic closeness to the reference data file, saidevaluation based on a relative comparison between the first file vectorand a different second file vector derived from properties of the newdata file and where the second file vector also preserves and isrepresentative of semantic properties of content of the new data file;determining availability and extent of at least one of (a) user dataobtained for the user, and (b) property vectors in candidate file data,said property vectors reflective of semantic qualities therein;providing the file recommendation based on a probabilistic weightingbetween: a content-based approach of semantic closeness evaluatedbetween the reference data file and the new data file; and a predictiveapproach based on one of a predictive model, a reinforcement learning“RL” algorithm or heuristic processing function, wherein the predictiveapproach is based on sufficiency in availability of user data andproperty vectors in candidate file data.
 17. The method of providing afile recommendation according to claim 16, wherein the probabilisticweighting between the content-based approach and the predictive approachvaries with time.
 18. The method of providing a file recommendationaccording to claim 16, wherein initially the content-based approach isabsolute.
 19. (canceled)
 20. (canceled)
 21. (canceled)
 22. (canceled)23. (canceled)
 24. (canceled)
 25. (canceled)
 26. (canceled) 27.(canceled)
 28. (canceled)
 29. (canceled)
 30. (canceled)
 31. (canceled)32. (canceled)
 33. (canceled)
 34. (canceled)
 35. A system containingprocessing intelligence arranged to provide a file recommendation basedon sematic qualities, the processing intelligence arranged to: process areference data file to extract properties therefrom; calculate a firstfile vector in property space from said extracted properties, whereinthe first file vector both preserves and is representative of semanticproperties of content of the reference data file; evaluate a new datafile in terms of semantic closeness to the reference data file, saidevaluation based on a relative comparison between the first file vectorand a different second file vector derived from properties of the newdata file and where the second file vector also preserves and isrepresentative of semantic properties of content of the new data file;determine availability and extent of at least one of (a) user dataobtained for the user, and (b) property vectors in candidate file data,said property vectors reflective of semantic qualities therein; providethe file recommendation based on a probabilistic weighting between: acontent-based approach of semantic closeness evaluated between thereference data file and the new data file; and a predictive approachbased on one of a predictive model, a reinforcement learning “RL”algorithm or heuristic processing function, wherein the predictiveapproach is based on sufficiency in availability of user data andproperty vectors in candidate file data.
 36. The system of claim 35,wherein the system intelligence is arranged to vary with time theprobabilistic weighting between the content based approach and thepredictive approach.
 37. The system of claim 36, wherein the systemintelligence initially makes the content-based approach absolute. 38.The system of claim 36, wherein the system intelligence is a server-sidecomponent remotely and selectively connected to a user device over anetwork.
 39. The system of claim 19, wherein the system intelligence islocated, at least in part, in a user device.
 40. The system of claim 35,wherein the processing intelligence is located, at least in part, in auser device.
 41. The system according to claim 35, wherein the filevector and each property vector is an output from a trained artificialneural network “ANN” that, following pairwise training of the ANN usingpairs of training files, maps pairwise similarity/dissimilarity inproperty space towards corresponding pairwise semanticsimilarity/dissimilarity in semantic space to preserve semanticevaluation by valuing, on a pairwise basis, semantic perceptionreflected in quantified semantic dissimilarity distance measures overproperty assessment reflected by distance measures in property space,said quantified semantic dissimilarity distance measures.
 42. Theprocessing system of claim 41, wherein: the ANN compares asubjectively-derived semantic vector against a property space vector,the subjectively-derived semantic vector being generated independentlyof the property space vector, the ANN correlating quantified semanticdissimilarity measures for the subjectively-derived semantic vector,which describes content in semantic space for each of a first data fileand also a different second data file, with related property separationdistances for the property space vector, which is provided in propertyspace and which describes measurable signal quality extracted forrespective content of both the first data file and the different seconddata file, to provide an output that is adapted, over time, to align aresult in property space to a result in semantic space, and wherein theANN is configured, during adaptation of weights in the ANN, to valuesemantic dissimilarity measures over measurable properties and such thatthe ANN is configured to map pairwise similarity/dissimilarity inproperty space for the first data file and the second data file towardscorresponding pairwise semantic similarity/dissimilarity in semanticspace for the first data file and the second data file thereby toconfigure a system, in identifying and quantifying similarity ordissimilarity in audio or image-based content, to output a measure ofsimilarity between said content of said first data file relative tocontent in said second data file, and the subjectively-derived semanticvector is derived using natural language processing (NLP) of a textdescription of content for each of the first data file and the differentsecond data file.